Introduction To Genomics Second Edition PDF
Introduction To Genomics Second Edition PDF
Arthur M. Lesk
The Pennsylvania State University
1
3
Great Clarendon Street, Oxford ox2 6dp
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide in
Oxford New York
Auckland Cape Town Dar es Salaam Hong Kong Karachi
Kuala Lumpur Madrid Melbourne Mexico City Nairobi
New Delhi Shanghai Taipei Toronto
With offices in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand Turkey Ukraine Vietnam
Oxford is a registered trade mark of Oxford University Press
in the UK and in certain other countries
Published in the United States
by Oxford University Press Inc., New York
© Arthur M Lesk 2012
The moral rights of the author have been asserted
Database right Oxford University Press (maker)
First edition 2007
All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
without the prior permission in writing of Oxford University Press,
or as expressly permitted by law, or under terms agreed with the appropriate
reprographics rights organization. Enquiries concerning reproduction
outside the scope of the above should be sent to the Rights Department,
Oxford University Press, at the address above
You must not circulate this book in any other binding or cover
and you must impose the same condition on any acquirer
British Library Cataloguing in Publication Data
Data available
Library of Congress Cataloging in Publication Data
Data available
Typeset by Graphicraft Limited, Hong Kong
Printed in Italy on acid-free paper by L.E.G.O. S.p.A. – Lavis TN
ISBN 978–0–19–956435–4
10 9 8 7 6 5 4 3 2 1
For Victor and Valerie
This page intentionally left blank
PREFACE TO THE FIRST EDITION
Of all the claims on our curiosity, we want most sources of our differences from our closest extant
to understand ourselves. What are we? What lies in non-human relatives, the chimpanzees? What do we
our future? Many features of our lives depend on have in common and how do we diverge from other
accidents of history. The time and place of our birth species of primates? of mammals? of vertebrates? of
largely determine what language we first learn to eukaryotes? of all other living things?
speak and whether we are likely to be well-fed and The complete sequences of human and other
well-educated and receive adequate medical care. genomes give us complete information about the
Many aspects of our future depend on events outside underlying text of this story. We are beginning to
ourselves and beyond our control. understand how our lives shape themselves under the
Within ourselves, also, there are constraints on our influence of our genes plus our surroundings.
lives that brook relatively little argument. In some We are also beginning to intervene. Genetic engineer-
respects, we are at the mercy of our genomes. Under ing of microorganisms is an established technique.
normal circumstances, all of our basic anatomy and Genetically modified plants and animals exist and
physiology, and eye colour, height, intelligence and are the subjects of lively debate. To override the genes
basic personality traits, are ingrained in our DNA for hair colour is trivial. Changes in lifestyle or
sequences. This is not to say that our genomes dictate behaviour can – to some extent – avoid or postpone
our lives. Some of the constraints are tight – eye development of diseases to which we are genetically
colour, for instance – but our genetic endowment prone. Gene therapy offers the promise of rectifying
also confers on us a remarkable robustness. some inborn defects.
This robustness also is a product of evolution. In this book, we shall explore this new knowledge,
When Shakespeare wrote of ‘the thousand natural what it tells us about ourselves and how we can apply
shocks that flesh is heir to’, he coupled the challenges it. With power derived from knowledge goes com-
of life to heredity. Within the last century, lifestyles mitment to act wisely. We have responsibilities, to
have changed with a rapidity hitherto unknown ourselves, to other people, to other species and to
(except for the instants of asteroid impacts). Our tal- ecosystems ranging in size up to the entire biosphere.
ents have many opportunities to nurture themselves Ethical, legal and social issues have been a promin-
and develop in novel ways, and we can meet and sur- ent component of the human genome project. Most
vive brutal stresses. These are gifts of our genomic technical questions in genomics, as in other scientific
endowment: What genes control is the response of an subjects, have objectively correct answers. We do not
organism to its environment. know all the answers, but they are out there for us to
The human genome is only one of the many com- discover. Ethical, legal and social issues are different.
plete genome sequences known. Taken together, Many choices are possible. Their selection is not the
genome sequences from organisms distributed widely privilege of scientists in individual laboratories, but
among the branches of the tree of life give us a sense, of society as a whole. Scientists do have a responsibil-
only hinted at before, of the very great unity in detail ity to contribute to the informed public discussion
of all life on Earth. This recognition has changed our that is essential for wise decisions.
perceptions, much as the first pictures of the Earth One problem encountered in writing about gen-
from space engendered a unified view of our planet. omics is the need to pick and choose from the many
Of course, superimposed on this basic unity is riches of the subject. The list of subjects that cannot
great variety. We ask: What is special about us? What be left out is too long and threatens to reduce the
do we share with our parents and siblings and how treatment of each to superficiality. There is also a
do we differ from them? What do we share with all serious organizational challenge: many phenomena
other human beings and what makes us different must be approached from several different points of
from the other members of our species? What are the view. A reader may be relieved to conclude that a
viii Preface to the first edition
topic has been beaten thoroughly into submission at least in part, to all of them. However, the central
in one chapter, only to encounter it again, alive and point of view remains focused on the biology.
kicking, in a different context. More specifically, the focus is on human biology.
The speed at which the field is moving causes other In fact, on the biology of humans who are curious
problems. One is often pleased with a draft of a sec- about other species, albeit primarily for what the
tion only to find the carefully described conclusions other species tell us about ourselves. This choice
modified in next week’s journals. Yet, there is a great naturally reflects the potential readership of this
pleasure in seeing Nature’s secrets emerging before book. (If bacteria or fruit flies could read, genomics
one’s eyes. textbooks would look very different.)
Another casualty of rapid progress is a loss of This book assumes that the reader already has
interest in history and biography. We are fantastically some acquaintance with modern molecular biology,
interested in the development of the sea urchin and and builds on and develops this background, as a
the fruit fly, but not at all in the development of self-contained presentation. It is suitable as a text-
molecular genomics. Intellectual struggles that occu- book for undergraduates or starting postgraduate
pied entire careers leave behind only terse conclusions, students.
often without any appreciation of the experiments Exercises, problems and ‘weblems’ at ends of chap-
that established the facts, much less of the alternative ters test and consolidate understanding and provide
hypothesis tested and rejected. The force of the scien- opportunities to practise skills and explore additional
tists’ personalities, and their foibles, are forgotten. subjects. Exercises are short and straightforward
This is too bad: those who do not learn from the suc- applications of material in the text. Answers to exer-
cesses of history will find it harder to emulate them. cises appear on the web site associated with the book.
Genomics is an interdisciplinary subject. The phe- Problems, also, make use of no information not
nomena we want to explain are biological. But many contained in the text, but require lengthier answers
fields contribute to the methods and the intellectual or in some cases calculations. The third category,
approaches that we bring to bear on the data. Physi- ‘weblems’, require access to the World Wide Web.
cists, mathematicians, computer scientists, engineers, Weblems are designed to give readers practice with
chemists, clinical practitioners and researchers, have the tools required for further study and research in
all joined in the enterprise. This book will appeal, the field.
PREFACE TO THE SECOND EDITION
Fast, inexpensive sequencing has transformed genom- support of conservation efforts dedicated to preserv-
ics. The landmark goal, the $US1000 human genome, ing endangered species. Development of alternative
will likely soon be achieved. At the time of writing, energy sources is a challenge to both physics and
thousands of human individuals have had their full biology.
genomes sequenced, and many more are on the way. Underlying these applications, genomics offers us
A very large number of people have had sequences a profound understanding of fundamental principles
determined for individual genes. For example, muta- of biology. On the personal level, genome exegesis
tions in BRCA1 suggest an increased likelihood of will fundamentally alter our perception of ourselves:
developing breast or ovarian cancer. what does it mean to be human? The answer lies
Genetic testing for disease is one of many possible somewhere in the complex interplay of our genes
fields of application of genomics. Clinical medicine and life histories. For some characteristics, there is a
heads the list; no one doubts that it was the promise simple answer. Your eye colour, and whether or not
of improvements in health that motivated support for you suffer from sickle-cell anaemia, depend exclu-
the original human genome project. Understanding sively on the sequences of particular genes. For most
the relationships between genes and disease will allow of your phenotypic traits, however, the assignment
more precise diagnosis and warnings of increased risk of contributions to their origin from genome, epi-
of disease in patients and their offspring. It will allow genetics, and life history is a severe challenge.
design of treatment tailored to the biochemical char- In this book, I have tried to present a balanced
acteristics of the patient, called pharmacogenomics. view of the background of the subject, the technical
Genomes of other organisms also have implications developments that have so greatly increased the data
for human health, especially those of pathogenic flow, the current state of our knowledge and under-
organisms that have developed, or are threatening to standing of the data, and applications to medicine
develop, antibiotic resistance. Other applications of and other fields. One aspect of the first edition that
genomics include improvement of crops and domes- I liked was its concision. Unfortunately, this has had
ticated animals, enhancing food production, and to be sacrificed to the stampeding progress of the field.
PLAN OF THE SECOND EDITION
Chapter 1, Introduction to Genomics, sets the stage, mented. With complete sequences of genomes of
and introduces all of the major players: DNA and many different species, we can confront and compare
protein sequences and structures, genomes and pro- them. But, of course, different individuals of a species
teomes, databases and information retrieval, and do not necessarily have identical genomes. What is
bioinformatics and the World Wide Web. Subsequent the nature and extent of the variability? With intra-
chapters develop these topics in detail. Chapter 1 species variability as a baseline, what are the simi-
briefly provides the framework of how they fit larities and differences between genomes of different
together and sets them in their context of biomedical, species? Comparative genomics thereby allows us to
physical, and computational sciences. address a question central not only to this book but to
Chapter 2 demonstrates that Genomes are the the field as a whole: what does it mean to be human?
Hub of Biology. Whereas genome sequences are Chapter 5, Evolution and Genomic Change, relates
determined from individuals, to appreciate life as a genomics to evolution, a major unifying principle of
whole requires extending our point of view spatially, biology. (Arguably the laws of thermodynamics are
to populations and interacting populations; and another.) T. Dobzhansky famously said: ‘Nothing in
temporally, to consider life as a phenomenon with a biology makes sense except in the light of evolution.’
history. We can study the characteristics of life in Description of some of the important ideas and tools
the present, we can determine what came before, and – of taxonomy and phylogeny, on the classical species
we can – at least to some extent – extrapolate to level and on the molecular level – will be useful in
the future. The ‘central dogma’ and the genetic code organizing the material in subsequent chapters.
underly the implementation of the genome, in terms Chapter 6, Genomes of Prokaryotes, surveys the
of the synthesis of RNAs and proteins. Absent from genomes of bacteria and archaea in more detail. Tax-
Crick’s original statement of the central dogma is the onomy and phylogeny of prokaryotes present prob-
crucial role of regulation in making cells stable and lems because of extensive horizontal gene transfer.
robust, two characteristics essential for survival. This challenges the whole idea of a hierarchy of bio-
Chapter 3, Mapping, Sequencing, Annotation, and logical classification. Many bacteria have been cloned
Databases, describes how genomics has emerged and studied in isolation, especially those responsible
from classical genetics and molecular biology. The for disease. However, a new field, metagenomics,
first nucleic acid sequencing, by groups led by F. deals with the entire complement of living things in
Sanger and W. Gilbert, in the 1970s, were a break- an environmental sample, allowing us to address
through comparable to the discovery of the double questions about interspecies interaction in the ‘real
helix of DNA. The challenges of sequencing stimu- world’. Sources include ocean water, soil, and the
lated spectacular improvements in technology. The human gut.
first was the automation of the Sanger method. The Chapter 7 surveys Genomes of Eukaryotes. It
original sequences of the human genome were accom- starts with yeast, which is about as simple as a
plished by batteries of automated Sanger sequencers. eukaryote can get. Selected plant, invertebrate, and
Subsquently a series of ‘new generations’ of novel chordate genomes illuminate the many profound
approaches have brought the landmark goal, the common features of eukaryotic genomes; and the very
$US1000 human genome, within reach. Where do all great variety of structures, biochemistry, and lifestyles
the data go? Chapter 3 also introduces the databanks that are compatible with the underlying similarities.
that archive, curate, and distribute the data, and Chapter 8, Genomics and Human Biology, de-
some of the information-retrieval tools that make velops applications to the study of our own species.
them accessible to scientific enquiry. Although clinical applications are undoubtedly the
Chapter 4, Comparative Genomics, begins with a most important, genomics has important contributions
general survey of the different modes of genome to make to human palaeontology, anthropology, and
organization with which living things have experi- the law. The ability to extract DNA from extinct
Plan of the second edition xi
species, including Neanderthals, sheds light on our to genomics. The interactions and relationships between
early evolution. Events in our history, including the genome and proteome are intimate, both during
migrations and domestication of crops and animals, cellular activity and in the longer term in evolution.
have left their traces in DNA sequences. (A colleague once entitled a keynote lecture: ‘Genes
Chapter 9 deals with Transcriptomics, the measure- are from Venus, proteins are from Mars.’)
ment and application of protein expression patterns. The last chapter emphasizes attempts to integrate
These measurements have been carried out using our data and understanding, an area known as sys-
microarrays. However, as sequencing technology tems biology. Whereas classical biochemistry made
grows in power, it may replace microarrays as the great contributions to demonstrating the properties
method of choice for these measurements. Applica- of proteins in isolation, our job now is to put things
tions treated include changes in different physio- back together. Chapter 11, Systems Biology, presents
logical states, such as the diauxic shift in yeast, or a description of biological organization that is based
sleep and waking in rats; in plant and animal develop- on networks. Cells contain parallel sets of networks
ment; and in diagnosis and treatment of disease. based on physical and logical interactions among
Chapter 10, Proteomics, describes the principles molecules. Each network also has static and dynamic
of protein structure and the high-throughput data aspects. The ultimate, most profound, goal is a com-
streams that provide information about sets of pro- plete and integrated picture of all of life’s activity,
teins in cells. Proteomics is an essential complement from the molecule to the biosphere.
NEW TO THIS EDITION
The most important change since publication of the ance. For many important diseases, a patient can
first edition has been the spectacular progress in expect more precise diagnosis and prognosis, and
high-throughput sequencing. The resulting growth of more precise recommendations for treatment. Can-
the data produced – in quality, quantity, and type – cer genomics – the comparison of sequences from
have altered the entire landscape of genomics itself, normal and tumour cells from single patients – has
and its influence has invaded surrounding fields. No become a major activity.
area of biology has been left unscathed. In the new edition, extended coverage is given both
The new edition reflects this. Many more complete to the applications of genome sequences to working
genomes are available. Instead of a few isolated snap- out of evolutionary relationships in microorganisms,
shots of the evolution, we can trace its pathways plants and animals; and to clinical applications to
through the phyla. Instead of sequencing single indi- humans. Non-clinical applications to human bio-
viduals, it is possible to measure directly the varia- logy and history are sufficient to justify a chapter of
tion within populations. Palaeogenomics has opened their own. The genomics of crop domestication not
a window onto extinct species. only shed light on human – as well as plant – history,
High-throughput sequencing has created novel but emphasize the reciprocal interactions between
data streams, such as RNAseq to measure the tran- humans and the rest of the biosphere.
scriptome. The new techniques complement, and may Nevertheless, progress is happening so fast as to
for some purposes even supersede, some common make unavoidable the feeling of frustration in aiming
experimental techniques such as microarrays. at a moving target. The hope is that the second edi-
Study of human disease through sequencing con- tion has erected for the reader a sound framework,
tinues to be a major effort. Identification of genes both intellectual and factual, that will make it pos-
responsible for particular diseases permits testing, sible, when encountering subsequent developments,
genetic counselling, and risk assessment and avoid- to see where and how they fit in.
RECOMMENDED READING
Where else might the interested reader turn? This as recommended reading at the ends of the chapters.
book is designed as a companion volume to three The goal is that each reader will come to recognize
others: Introduction to Protein Architecture: the his or her own interests, and be equipped to follow
Structural Biology of Proteins; Introduction to Pro- them up.
tein Science: Architecture, Function, and Genomics; Many applications of genomics to health care are
and Introduction to Bioinformatics (all published by discussed in the book. However, nothing here should
Oxford University Press). Of course there are many be taken as offering medical advice to anyone about
fine books by many authors, some of which are listed any condition.
Results and research in genomics make use of To this end, an Online Resource Centre at
the web, both for storage and distribution of data, www.oxfordtextbooks.co.uk/orc/leskgenomics2e/
and methods of analysis. Readers will need to accompanies this book. This contains material from
become familiar with web sites in genomics, and the book – figures and ‘movies’ of the pictures of
to develop skills in using them. Many useful sites structures, answers to exercises, and hints for solving
are mentioned in the book. The author’s Introduc- problems. In addition it contains a guided tour of
tion to Bioinformatics offers a pedagogical approach web sites in genomics, coordinated with the printed
to computational aspects of genomics. However, book, with additional exercises and problems (of the
clearly the place to learn about the web is on the ‘weblem’ variety). Some of these are suitable for use
web itself. as practical or laboratory assignments.
ACKNOWLEDGEMENTS
I thank S. Ades, G.F. Anderson, M.M. Babu, S.L. Nacheva, A. Nekrutenko, G. Otto, A. Pastore, D.
Baldauf, P. Berman, B. de Bono, D.A. Bryant, C. Perry, C. Praul, K. Reed, G.D. Rose, J. Rossjohn, S.
Cirelli, A. Cornish-Bowden, N.V. Fedoroff, J.G. Schuster, B. Shapiro, J. Tamames, A. Tramontano,
Ferry, R. Flegg, J.R. Fresco, D. Grove, R. Hardison, A.A. Travers, A. Valencia, G. Vriend, L. Waits, J.C.
E. Holmes, H. Klein, E. Koc, A.S. Konagurthu, T. Whisstock, A.S. Wilkins and E.B. Ziff for helpful
Kouzarides, H.A. Lawson, E.L. Lesk, M.E. Lesk, advice.
V.E. Lesk, V.I. Lesk, D.A. Lomas, B. Luisi, P. Maas, I thank the staff of Oxford University Press for
K. Makova, W.B. Miller, C. Mitchell, J. Moult, E. their skills and patience in producing this book.
This page intentionally left blank
CONTENTS
1 Introduction to Genomics 3
The human genome 4
Phenotype = genotype + environment + life history + epigenetics 4
Introduction 116
Unity and diversity of life 116
Taxonomy based on sequences 117
Phylogeny 181
Phylogenetic trees 183
Clustering methods 184
Cladistic methods 185
The problem of varying rates of evolution 185
Bayesian methods 186
Archaea 195
The genome of Methanococcus jannaschii 197
Life at extreme temperatures 197
Comparative genomics of hyperthermophilic archaea:
Thermococcus kodakarensis and Pyrococci 201
Bacteria 204
Genomes of pathogenic bacteria* 204
Genomics and the development of vaccines* 206
Introduction 266
Applications of DNA microarrays 268
10 Proteomics 297
Introduction 298
Protein nature and types 298
Protein structure 299
The chemical structure of proteins 299
Conformation of the polypeptide chain 300
Protein folding patterns 301
Epilogue 389
Index 390
Collect a sample of your cells by rinsing out your mouth with dilute salt water. Add
some detergent to dissolve the cell and nuclear membranes, releasing the contents.
Precipitate out the proteins by adding alcohol; stir, and let settle.
In a few minutes, fibres will rise out of the murk to the top of the vessel.
These fibres are your DNA (plus that of some bacteria that were living in your
mouth). The sequence of your DNA is your lifelong endowment. It determined
that you are human, that you are male or female, that you are brown- or blue-eyed,
that you are right- or left-handed. But think of these fragile cables not as a leash that
constrains you but as the cords that raise the curtains on the drama of your life.
Introduction to Genomics
LEARNING GOALS
• Knowing the basic facts about the human genome – how many base pairs it contains, estimates
of how many genes it contains that code for proteins or RNAs.
• Recognizing the contributions to any individual’s phenotype from the genome sequence itself,
from life history, and from epigenetic signals within the fertilized egg.
• Appreciating that the human genome contains extensive repetitive regions of various kinds.
• Knowing the basic central dogma, that DNA is transcribed to RNA, which is translated to
protein. Beyond this, knowing that many protein-coding genes show variable splicing, which
adds another dimension of complexity to the scheme by which the genome specifies the
proteome.
• Understanding the importance of comparative genome sequencing projects, to reveal processes
of evolution, and to help interpret regions in the human genome.
• Appreciating the large number of genome projects treating different species, widely distributed
among life forms, and including metagenomics, which produces very large amounts of data
from environmental samples.
• Understanding how different types of human DNA sequencing projects are organized, including
those carried out by large international organizations, the collection of data by law-enforcement
organizations, those carried out for specific clinical tests, and direct-to-the-public sequencing
usually motivated by questions of genealogy.
• Distinguishing the several types of potential applications of genome sequence data to medicine,
including ways in which information about large numbers of people can support clinical research,
and ways in which information about specific individuals can help treat disease more effectively,
or – even better – prevent it.
• Understanding the importance of computer science and bioinformatics in producing the raw
sequence data, in creating databases in molecular biology, in archiving and careful curation of
the data, in distributing them via the web, and in creating information-retrieval tools to allow
effective mining of the data for research and applications.
• Appreciating the ethical, legal, and social implications of collection of DNA sequence data and
the conflicting demands of public safety and individual privacy.
4 1 Introduction to Genomics
A human genome contains approximately 3.2 × 109 Of great importance to clinical applications of
base pairs, distributed among 22 paired chromo- genomics are traits that govern susceptibility to
somes, plus two X chromosomes in females and disease and risk factors; and those that determine
X and Y in males. The first human genomes were the effectiveness of different drugs in different indi-
determined in 2001, the culmination of 10 years of viduals. These allow for personalized prevention
pioneering work and dedication. Since then, advances and treatment of disease based on DNA sequences,
in technology have made genomic sequencing cheaper or pharmacogenomics.
and faster. Sequence data now flow copiously. This
• Your life history includes the integrated total of
creates the challenges of understanding the informa-
your experiences, and the physical and psycholo-
tion that our genomes contain, and applying the data
gical environment in which you developed. Your
and analysis to improve human welfare. Sequencing
nutritional history has influenced your physical
genomes of other species both facilitates these goals
development. A nurturing environment and educa-
and extends them, by revealing general principles of
tional opportunities have influenced your mental
biology.
development. Less obvious than most aspects of
How do the contents of our genomes determine
your life history is the growing recognition of
who we are?
the importance of your in utero environment in
determining your development curve and even
Phenotype = genotype + environment + your adult characteristics.
life history + epigenetics • At the interface between the genome and life experi-
Each reader of this book is an individual, with phys- ence are epigenetic factors. It is largely true that all
ical, biochemical, and psychological characteristics. cells of your body, except sperm or egg cells and
(Do not be surprised if these distinctions become cells of the immune system, have almost the same
more and more nebulous!) Each of you has a general DNA sequence (subject to usually modest amounts
form and metabolism that is common to all humans. of accumulated mutations). Yet your tissues are
At the molecular level, you have much in common differentiated, with different sets of genes expressed
with other species as well. But there is also substantial or silenced in liver, brain, etc.
variation within our species, to give you your indi- Now, some of these regulatory signals survive
vidual appearance and character. You are in a state cell division. (When a liver cell divides, it divides
of health somewhere within the spectrum between into two liver cells.) Your parents’ own life his-
robust good health and morbid disease. You are cur- tories might have altered the epigenetic patterns in
rently in some psychological state, and in some mood, their cells, and the fertilized egg from which you
reflecting your personality and current activities. were subsequently formed contained some of these
‘pre-differentiation’ signals. In this way inheri-
• Your genotype is your DNA sequence, both nuclear tance of acquired characteristics has re-entered
and mitochondrial. (For plants, include also the respectable mainstream biology.
sequence of the chloroplast DNA.)
• Your phenotype is the collection of your observ- The relative importance of these factors in deter-
able traits, other than your DNA sequence. mining your phenotype varies from trait to trait.
These include macroscopic properties such as Some are determined solely and irrevocably by
height, weight, eye and hair colour; and micro- your alleles for specific genes. Others depend on
scopic ones, such as possible sickle-cell anaemia, complex interactions between your genes and
glucose-6-phosphate deficiency, or the retention your life history, and epigenetic signals from your
beyond infancy of the ability to digest lactose. parents.
The human genome 5
A walk through the human genome is like a tour of Each triplet of nucleotides in the message corres-
a continent. One encounters centres of bustling ponds to one amino acid, according to the genetic
activity, rich in genes and their regulatory elements. code (see Box 1.1). Box 1.2 lists the 20 canonical
These are like villages and even cities. One passes amino acids.
also through large tracts of emptiness, or regions Despite their importance, protein-coding genes
with unrelieved monotony of repeated elements. occupy a small fraction of the human genome – no
An inventory of the human genome includes: more than about 2–3% of the overall sequence. They
are distributed across the different chromosomes,
• The most prominent and familiar aspects of the
but not evenly. Many protein-coding genes appear
genome, the regions that code for proteins. Protein-
in multiple copies, either identical or diverged into
coding genes are transcribed into messenger RNA
families. For instance, humans have over 900 related
(mRNA). After processing, ribosomes translate
olfactory-receptor genes, and some animals have
mature mRNA to polypeptide chains.
many more.
• Some regions of the genome encode non-protein-
coding RNA molecules (that is, RNAs exclusive
• Francis Crick encapsulated this scheme in the Central of messenger RNAs), including but not limited
Dogma of Molecular Biology: to transfer RNAs, the RNA components of ribo-
somes, and microRNAs and small interfering
DNA makes RNA makes Protein
RNAs that regulate translation (miRNAs and
Transcription of a protein-coding gene into RNA is followed The standard genetic code
in eukaryotes by splicing to form a mature messenger RNA ttt Phe F tct Ser S tat Tyr Y tgt Cys C
(mRNA) molecule. The ribosome synthesizes a polypeptide ttc Phe F tcc Ser S tac Tyr Y tgc Cys C
chain according to the sequence of triplets of nucleotides, tta Leu L tca Ser S taa STOP tga STOP
or codons, in the mRNA. The protein folds spontaneously ttg Leu L tcg Ser S tag STOP tgg Trp W
to a native three-dimensional structure that accounts for its
ctt Leu L cct Pro P cat His H cgt Arg R
biological function.
ctc Leu L ccc Pro P cac His H cgc Arg R
The standard genetic code is shown here. The codons
cta Leu L cca Pro P caa Gln Q cga Arg R
are those appearing in DNA rather than RNA; that is, the
ctg Leu L ccg Pro P cag Gln Q cgg Arg R
codons contain t rather than u. Both three-letter and one-
letter abbreviations for the amino acids appear. Note that att Ile I act Thr T aat Asn N agt Ser S
the code is redundant: except for Met and Trp, multiple atc Ile I acc Thr T aac Asn N agc Ser S
codons specify the same amino acid. A mutation that ata Ile I aca Thr T aaa Lys K aga Arg R
changes a codon to another codon for the same amino acid atg Met M acg Thr T aag Lys K agg Arg R
is called a synonymous mutation. Three triplets are reserved gtt Val V gct Ala A gat Asp D ggt Gly G
as STOP signals, effecting termination of translation. gtc Val V gcc Ala A gac Asp D ggc Gly G
Variations from this standard code occur in mitochondria gta Val V gca Ala A gaa Glu E gga Gly G
and chloroplasts, and sporadically in individual species. gtg Val V gcg Ala A gag Glu E ggg Gly G
Contents of the human genome 7
Non-polar amino acids Amino acid names are frequently abbreviated to their first
G glycine A alanine P proline V valine three letters – for instance Gly for glycine – except for iso-
I isoleucine L leucine F phenylalanine M methionine leucine, asparagine, glutamine, and tryptophan, which are
abbreviated to Ile, Asn, Gln, and Trp, respectively. The rare
Polar amino acids amino acid selenocysteine has the three-letter abbreviation
Sec and the one-letter code U. The even rarer amino acid
S serine C cysteine T threonine N asparagine
pyrrolysine has the three-letter abbreviation pyl and the
Q glutamine Y tyrosine W tryptophan
one-letter code O.
Amino acid sequences are always stated in order from
Charged amino acids
the N-terminal to the C-terminal. This is also the order in
D aspartic acid E glutamic acid K lysine R arginine which ribosomes synthesize proteins: ribosomes add amino
H histidine acids to the free carboxy terminus of the growing chain.
siRNAs). There are about 3000 genes coding for the regulatory sites themselves, and all the proteins
RNAs, exclusive of the mRNAs translated to pro- and RNAs encoded that have regulatory functions,
teins. It is becoming clear that the RNA-ome is arguably including receptors.
much richer than had been suspected. Except for • Repetitive elements of unknown function account
RNAs involved in the machinery of protein syn- for surprisingly large fractions of our genomes.
thesis, such as transfer RNAs and the ribosome Long and Short Interspersed Elements (LINES and
itself, most non-coding RNAs are involved in con- SINES) account for 21% and 13% of the genome.
trol of gene expression. Even more-highly repeated sequences – minisatel-
The regions that encode proteins and non-protein- lites and microsatellites – may appear as tens or
coding RNAs correspond to molecules that form even hundreds of thousands of copies, in aggregate
non-transient parts of the cell’s contents. They do of amounting to 15% of the genome. (See Box 1.3).
course ‘turn over’, but at rates lower than messenger
RNAs. mRNAs must have short lifetimes in order to • We know functions of some regions of the genome.
turn off transcription, as part of the processes that That we cannot assign functions to others may merely
regulate gene expression. reflect our ignorance. Some regions appear to have a
life of their own, hitchhiking and reproducing within
• Other regions contain binding sites for ligands
genomes, and contributing to evolution. In some
responsible for regulation of transcription. In
cases they actively enhance rates of chromosomal
assessing the total amount of the genome dedi-
rearrangements.
cated to control, one would need to include both
It is now believed that the human genome contains Often a neighbourhood of a gene contains a set of
about 23 000 protein-coding genes. Some regions of closely linked related genes. This is because a com-
the genome are relatively poor in protein-coding mon mechanism of evolution is gene duplication fol-
genes. These include the subtelomeric regions, on all lowed by divergence. It is often possible to follow
chromosomes, and chromosomes 18 and X. In con- evolution through a set of successive duplications.
trast, chromosomes 19 and 22 are relatively rich in However, in some cases a set of identical copies of a
protein-coding genes. gene can appear on different chromosomes. The gene
Most human protein-coding genes contain exons for ubiquitin is an example.
(expressed regions) interrupted by introns (regions Ideally, it would be possible, following determina-
spliced out of mRNA and not translated to protein). tion of a genome sequence, to infer the corresponding
The average exon size is about 200 bp. It is primarily proteome – that is, the amino-acid sequences of the
the variability in intron size that causes the large size proteins expressed. However, several mechanisms in-
differences among protein-coding genes: the gene troduce additional variety into the genome–proteome
for insulin is 1.7 kb long, the LDL receptor gene is relationship:
5.45 kb, and the dystrophin gene is 2400 kb.
Genes appear on both strands. In many cases, • In eukaryotes, a mechanism of generating variety
unrelated genes are fairly well separated. How- from a single gene sequence is alternative splicing.
ever, there are examples of genes that partially over- Alternative splicing involves forming a mature
lap; and cases of an entire gene appearing, on the messenger RNA from different choices of exons
complementary strand, within an intron of another from a gene, but always in the order in which they
gene. appear in the genome. It is estimated that ∼ 95%
A typical protein-coding gene locus contains of multi-exon protein-coding genes in the human
the exons and introns, with splice-signal sites at the genome produce splice variants. There are also
intron–exon junctions. Transcription of the gene may some known cases in which multiple promotors
be under the control of cis-regulatory elements near lead to transcription of parts of the same region
the gene, either upstream or down. Other regulatory into different proteins. If the reading frames of the
elements may appear elsewhere in the genome, even different transcripts are not in phase, there will be
on different chromosomes. no relationship between the protein sequences.
Genes that encode the proteome 9
• In both prokaryotes and eukaryotes, RNA editing The leap from the one-dimensional world of
can produce one or more proteins for which the sequences to the three-dimensional world
amino-acid sequence may differ from that predicted we inhabit
from the genome sequence. For instance, in the
Gene sequences, mRNA sequences and amino acid
wine grape (Vinus vinifera), the mRNAs arising from
sequences are all, from the logical point of view,
every mitochondrial protein-coding gene are sub-
one dimensional. To perform their proper catalytic,
ject to multiple C→U editing events, most of which
regulatory, or structural activities, proteins must
alter the encoded amino acid. In humans, nuclear
adopt precise three-dimensional structures. The mir-
protein-coding genes are subject to editing that
acle is that the structure is inherent in the amino acid
changes adenine to inosine (inosine has the coding
sequence. (See Figure 1.1.) For each natural amino
properties of guanine). The editing, and hence the
acid sequence, there is a unique stable native state that
final amino-acid sequence, can be tissue-specific.
under proper conditions is spontaneously taken up.
Variable splicing and RNA editing describe the The evidence is from the reversible denaturation
relationship between genome sequences and proteins of proteins: a native protein that is heated, or other-
potentially encoded in them. Of course, a large pro- wise brought to conditions far from its normal
portion of cellular activity is dedicated to the regulation physiological environment, will denature to a dis-
of gene expression – the selection of which potentially ordered, biologically inactive state. When normal
encoded proteins are expressed, and in what amounts. conditions are restored, proteins renature, readopt-
The immune system stands outside the general ing the native structure, indistinguishable in structure
assertions about the protein-coding regions of the and function from the original state. No information
human genome. The number of antibodies that we is available to the denatured protein, to direct its
produce dwarfs all the other proteins – it is estimated renaturation, other than the amino acid sequence.
that each human synthesizes 108–1010 antibodies. The (See Box 1.4.)
generation of such high diversity arises by special com- We therefore have the paradigm:
binatorial splicing at the DNA, not the RNA, level.
• DNA sequence determines protein sequence;
Some regions of the genome contain pseudogenes.
Pseudogenes are degenerate genes that have mutated • protein sequence determines protein structure;
so far from their original sequences that the poly- • protein structure determines protein function.
peptide sequence they encode will not be functional.
Because amino acid sequence determines protein
In some cases, pseudogenes have been picked up by
structure, we should be able to write computer pro-
viruses from mRNA, and reverse transcribed. This
grams to predict protein structures. This would be
is recognizable from the fact that the introns have
useful, because we know many more amino acid
been lost.
sequences than experimentally determined three-
dimensional structures of proteins. Structure predic-
• The human genome, and other eukaryotic genomes, tion methods have recently improved substantially
contains genes that code for proteins and non-coding
(see Chapter 10). Reliable predictions will allow the
RNAs (that is, other than messenger RNAs), control
creation of a library of the structures of the proteins
regions, pseudogenes (non-functional sequences derived
from genes by degeneration), and a wide variety of
encoded in any genome.
repetitive sequences. Genes that encode proteins in The principle that amino acid sequence dictates pro-
eukaryotes contain exons – regions that can be trans- tein structure has been the most fundamental principle
lated – and introns – regions that are spliced out before of structural molecular biology. Imagine, therefore,
translation. The possibility of omitting one or more exons, the shock produced by the observation of an effect of
called variable splicing, adds complexity to the relation- a synonymous mutation on a protein structure. In
ship between base sequences of genes and amino acid humans, the Multidrug Resistance 1 (MDR1) gene
sequences of proteins. In many cases, RNA editing cre- encodes a membrane pump, P-glycoprotein. In 2007,
ates additional differences between the DNA sequences
Kimchi-Sarfaty et al. observed that a synonymous
in the genome and the amino acid sequences of the
mutation in MDR1 produces a product with altered
proteins.
affinity for ligands. The hypothesis is that protein
10 1 Introduction to Genomics
Genetic code
‘translation table’
Figure 1.1 A most ingenious paradox: the translation of DNA sequences to amino acid sequences is very simple to describe logically;
it is specified by the genetic code. The folding of the polypeptide chain into a precise three-dimensional structure is very difficult to
describe logically. However, translation requires the immensely complicated machinery of the ribosome, tRNAs, and associated
molecules, but protein folding occurs spontaneously.
Chromosomes, organelles, and plasmids simple cells without a nucleus, from eukaryotes, cells
with nuclei.
The biosphere as we know it includes living things From the point of view of genomics, the most
based on cells and also viruses. The most general relevant difference between prokaryotic and eukary-
classification of cells, according to both their struc- otic cells is the form and organization of the genetic
ture and molecular biology, divides prokaryotes, material.
Varieties of genome organization 11
Size 10 mm ∼0.1 mm
Subcellular division No nucleus Nucleus
State of major component Circular loop, few proteins Complexed with histones to form
of genetic material permanently attached chromosomes
Internal differentiation No organized subcellular structure Nuclei, mitochondria, chloroplasts, cytoskeleton,
endoplasmic reticulum, Golgi apparatus
Cell division Fission Mitosis (or meiosis)
In structuring their DNA, cells have two problems replication and RNA synthesis in gene transcription.
to solve. The first is a packaging challenge. The Nuclear DNA is complexed with histones and other
DNA of E. coli is 1.6 mm long but must fit into a cell proteins to form chromosomes, large nucleoprotein
2 mm long and 0.8 mm wide (1 mm = 0.001 mm). complexes. Each chromosome contains a single linear
Eukaryotic cells have a harder version of the prob- molecule of DNA. The nuclei of different species
lem: the nucleus of a human cell has a diameter of contain different numbers of chromosomes and
6 mm, but the total length of the DNA is 1 m. How in each species the chromosomes vary in length.
can the DNA be deployed in cells in a form that is Humans contain 46 chromosomes – 22 pairs, plus
compact but accessible to proteins active in replica- two X chromosomes in females or one X and one Y
tion and transcription? The second problem is what chromosome in males. Deviations from the normal
to do during cell division, after DNA replication, to complement of chromosomes have clinical conse-
ensure that each daughter cell gets one copy of the quences; for example, the presence of three copies
DNA. Different types of cells, and viruses, solve these of chromosome 21 (trisomy 21) is associated with
problems in different ways. Down’s syndrome.
In the typical prokaryotic cell, most of the DNA The state – notably the accessibility to transcrip-
has the form of a single closed, or circular, molecule. tional machinery – of different regions of DNA in
It is complexed with proteins to form a structure eukaryotic chromosomes is modulated by the local
called a nucleoid. The DNA is attached to the structure of the chromosome, notably the interaction
inside of the plasma membrane but is accessible with histones (Figure 1.2).
to molecules in the cytoplasm. Some bacteria have Subcellular organelles, including mitochondria and
multiple circular DNA molecules; others have linear chloroplasts, are believed to have originated as intra-
DNA. In addition, prokaryotic cells can contain cellular parasites (see Box 1.4). These organelles con-
plasmids, small pieces of circular DNA, neither tain additional DNA in the form of single closed or
complexed permanently with protein nor attached circular molecules, uncomplexed with histones, like
to the membrane. The development and spread of the DNA of prokaryotes. Mitochondria and chloro-
antibiotic resistance in pathogenic bacteria, an plasts also contain their own protein-synthesizing
increasingly serious public health problem, are often machinery, using a slightly different dialect of the
associated with exchange of plasmids among strains. nearly universal genetic code.
Bacterial plasmids are also used as vectors for genetic Eukaryotic cells may also contain plasmids. Yeast
engineering. artificial chromosomes (YACs) – plasmids in yeast
In a eukaryotic cell, most of the DNA is seques- cells – are of great utility in genome sequencing
tered in the nucleus. The nucleus is the site of DNA projects.
12 1 Introduction to Genomics
“beads-on-a-string”
created by formation
of nucleosomes
30 nm solenoid
(b)
Mitochondria and chloroplasts are subcellular particles establishment of a pH gradient across the organelle mem-
involved in energy transduction (Figure 1.3). Mitochondria brane. The passage of protons through the membrane is
carry out oxidative phosphorylation, the conversion into coupled to generation first of mechanical energy and then
ATP of reducing power derived from metabolizing food. of chemical bond energy through the action of the molecu-
Chloroplasts carry out photosynthesis, the capture of light lar motor ATP synthase: chemiosmotic energy stored in pH
energy in the form of ‘reducing power’ – NADPH – and ATP. gradient → mechanical energy in ATP synthase → chemical
Both types of particle lead a quasi-independent life bond energy of ATP.
within the cell. They are surrounded by membranes, they Where might a cell look to find a small, self-enclosed, and
have their own genetic material and protein-synthesizing largely self-sufficient object to serve as an organelle? Why,
machinery, and they reproduce themselves within cells at another cell! There is a consensus that mitochondria
independently of cell division. and chloroplasts originated as prokaryotic endosymbionts
To perform their functions, it is essential for mitochon- that originally lived independently but took up residence
dria and chloroplasts to be enclosed. This is because elec- inside other cells. Evidence for this includes: (1) the state of
trochemical and photochemical energy conversion require the DNA: organelle DNA is circular and uncomplexed with
Varieties of genome organization 13
• All coding regions have non-random sequence Table 1.2 Transposable elements in the human genome
characteristics, based partly on codon usage pre-
Element Estimated number % of total genome
ferences. Empirically, it is found that statistics of
hexanucleotides perform best in distinguishing SINE + LINE 2.4 × 106 33.9
coding from non-coding regions. Using a set of LTR 0.3 × 106 8.3
known genes from an organism as a training set, Transposons 0.3 × 106
2.8
pattern-recognition programs can be tuned to par- Total 3.0 × 106 ∼45
ticular genomes.
Data from: Bannert, N. & Kurth, R. (2004). Retroelements and the
human genome: New perspectives on an old relation. Proc. Natl.
Acad. Sci. USA 101, 14572–14579.
• An early step in analysing a newly sequenced genome
is to find the genes that code for proteins and RNAs,
and to try to identify them.
Different types of element show alternative mech-
anisms of transposition.
Retrotransposons (class I) replicate via an RNA
Dynamic components of genomes intermediate. Many, if not all, of them are degenerate
Transposable elements are skittish segments of DNA, retroviruses.
found in all organisms, that move around the genome. Transposons (class II) produce DNA copies with-
They were discovered by B. McClintock in the 1940s out an intermediate RNA stage. They encode an
in studies of maize. Transposable elements in Indian enzyme called transposase, which recognizes sequences
maize (or corn) create a genetic mosaic, giving the within the transposon itself, cuts it out, and inserts
ears a mottled appearance (see Figure 1.5). In this it elsewhere. Often the excision is sloppy, leaving a
case, transposition is fast enough to affect an indi- mutation at the original site. Sometimes, a bit of the
vidual organism. Other transposable elements move surrounding sequence adheres to and accompanies
more slowly, on evolutionary timescales. the transposed material.
The focus of this section is transposable elements Because transposable elements can replicate, they
within nuclear genes of eukaryotes. We shall discuss are related to some of the types of repetitive sequence
the traffic of genes between organelle (mitochondria found in genomes (see Table 1.2). Mammalian
and chloroplasts) and nuclear genomes in Chapter 4. genomes contain retrotransposons (RNA-mediated
replication) called long and short interspersed
elements (LINEs and SINEs). LINEs are typically
1–5 kb long, with tens to tens of thousands of copies.
The most common LINE, L1, appears ∼20 000 times
in the genome. SINEs are typically 200–300 bp long,
with hundreds of thousands of copies in the genome,
at scattered locations. The human genome contains
about 300 000 copies of the most common SINE,
the Alu element, which is 280 kb long. The total
amount of L1 + Alu is 7% of the human genome.
LINEs encode a reverse transcriptase and can repli-
cate autonomously. SINEs are too short to encode
their own reverse transcriptase. SINEs depend on
LINEs or other sources of the required activities for
Figure 1.5 The ears of Indian maize are mosaics. The dark
replication.
pigments are anthocyanins. Yellow sectors arise when a jumping
element or transposon interferes with expression or function of Transposons contain inverted repeats at their ends,
the genes for biosynthesis of anthocyanins, during development which are the targets of the excision machinery (see
of individual kernels. Figure 1.6 and Box 1.6). Replication may occur in
(Photograph courtesy of L.D. Graham, Rockingham, VA, USA) ‘cut-and-paste’ mode, moving the transposon from
16 1 Introduction to Genomics
transposase
Figure 1.6 Fragment of a chromosome containing a transposon (red). The ends of the transposon contain inverted repeat sequences,
demarcating the region for excision. Within the transposon is a gene for the transposase enzyme, giving the region the capacity for
autonomous replication.
one site to another. Alternatively, in ‘copy-and-paste’ leading to Prader–Willi and Angelman syndromes
mode, replication leaves the original copy behind, (see Chapter 3) are associated with a mutation in
while creating another. the sequence of a nearby transposable element.
If two equivalent transposons are nearby, they can • Leakage of epigenetic modification. From the
move a whole segment including all the material landlord’s point of view, transposable elements are
between them. Transfer of multiple genes to a plasmid squatters. For example, they make up 70% of the
is a common mechanism of generation of antibiotic maize genome. In one sense, it is bad enough that
resistance in bacteria. (The Tn3 transposon illus- they clutter up the DNA, but, even worse, eukary-
trated in Box 1.6 contains only one set of terminal otes must defend themselves against the expression
repeats.) of transposable elements. The tactics of defence is
Biological effects of transposable elements include: to methylate transposable elements, or to use small
• Sequence broadcasting. Multiple copies of ele- interfering RNAs (siRNAs) (see pp. 6–7). Some can-
ments of a sequence may be distributed to various cers and other diseases that lead to hypomethyla-
locations in the genome. tion of DNA can cause transcriptional reactivation
of some transposable elements. Methylation also
• Altering properties of genes. The arrival of a frag-
cuts down on the mobility of those transposable
ment of sequence within or in the vicinity of a gene
elements that require transcription for mobility.
may, if inserted into a coding region, render the
However, mechanisms of silencing transposable
gene product non-functional, creating a ‘knock-
elements can also affect neighbouring genes.
out’ effect. An inserted segment near a gene may
affect its regulation or alter its splicing pattern.
(Approximately 20% of human genes have
transposable elements in flanking non-coding BOX The 5′ and 3′ ends of transposons
sequences.) Even an insertion in an intron can 1.6 contain inverted repeats
affect rates of transcription by slowing down the
polymerase as it passes through. The beginning and end of the sequence of the Tn3
• Transposable elements as an important engine of transposon of E. coli contain a terminal repeat. This is
evolution. They provide a mechanism for gene a plasmid encoding b-lactamase, conferring ampicillin
evolution by gene fusion or exon shuffling. resistance, as well as a transposase.*
Transposable element insertion can cause species- 1 GGGGTCTGAC GCTCAGTGGA ACGAAAACTC
specific alternative splicing patterns. This can pro- ACGTTAAGCA ACGTTTTCTG CCTCTGACGC
duce new protein isoforms. It can also lead to 61 CTCTTTTAAT GGTCTCAGAT GACCTTTGGT
disease; for example, the cause of a case of orni- CACCAGTTCT GCCAGCGTGA AGGAATAATG
thine aminotransferase deficiency was a single ...
base change that activated a cryptic 5′ splice site 4861 TTTTTAATTT AAAAGGATCT AGGTGAAGAT
in an Alu element, introducing an in-frame stop CCTTTTTGAT AATCTCATGA CCAAAATCCC
codon leading to a truncated protein. 4921 TTAACGTGAG TTTTCGTTCC ACTGAGCGTA AGACCCC
• Causing chromosomal rearrangements. This can
* Heffron, F., McCarthy, B.J., Ohtsubo, H. & Ohtsubo, E. (1979).
include inversions, translocations, transpositions, DNA sequence analysis of the transposon Tn3: three genes and
and duplications, perhaps through mispairing of three sites involved in transposition of Tn3. Cell 18, 1153–1163.
chromosomes during cell division. The deletions
Genome sequencing projects 17
• Transposable elements add a dynamic component to genetic change, at a higher level than point mutations. These
changes may have significant biological effects.
The static contents of the human genome, and its mine their function. Regions that are not conserved
dynamic aspects, are similar in general features to can be reserved for later.
what other genomes contain. As genome sequencing
techniques become easier, the field is progressing in the He who does not know foreign languages does not know
directions: (a) to determine more and more human anything about his own – Goethe, Kunst und Alterthum
genome sequences, especially those that may prove
useful in research into disease anticipation and pre- Sequences of genomes of other species also have
vention, and (b) many different species have now had direct application to human welfare. The genomes of
their genomes sequenced, for at least one individual. pathogens that have developed antibiotic resistance,
The National Center for Biotechnology Informa- or are threatening to, give clues that we can use to try
tion database currently reports: to keep ahead of them. Other practical applications
include improving crops and domesticated animals.
Table 1.3 Genome sequencing projects. Figures refer to numbers
of species and strains, not numbers of individualsa
Genomics can also support conservation efforts
aimed at preserving endangered species.
Organism type Number of Number of Genomics, allied with anthropology and archaeo-
genomes genomes logy, helps recount the history of the human species.
completed in progress It can reveal patterns of migration. It can trace
Viruses and viroidsb 3889 the domestication of plants and animals. These
Archaea 113 91 applications to many aspects of human biology are
Bacteria 1588 4914 the subject of Chapter 8.
Eukaryotes 36 1175
Many genome projects target individual species.
In addition, a major component of public DNA
a
Source: http://www.ncbi.nlm.nih.gov/genome sequence data repositories comes from metagenomic
b
A viroid is a small single-stranded RNA which can replicate
autonomously, but does not encode protein, nor is encapsulated data. These are sequences determined from environ-
within a coat. mental samples, without isolating individual organ-
isms. Sources include ocean water, soil samples, and
There are many reasons for sequencing non-human the human gut.
genomes. The most important ones are that they
reveal the processes of evolution, and that they help
us to understand the functions of different regions • Some readers will be surprised to learn that within the
of the human genome. volume of their bodies, there are more prokaryotic
cells than human ones.
Other genomes are essential to illuminate ours. An
important principle is that if evolution conserves
something, it is essential. If evolution does not con- Nevertheless, the focus of genome sequencing
serve something, it is not essential. In trying to under- efforts has been human subjects. Clinical applica-
stand the function of the ∼98% of the human genome tions have created a very large amount of sequence
that does not recognizably code for proteins or non- data for individual human genes. Many people have
coding RNAs, we gain important clues by comparing undergone genetic testing: for example, many pro-
the human genome with other mammalian genomes. spective parents determine whether they are carriers
Regions that are conserved must be conserved for a of cystic fibrosis. Many women test for potentially
reason. We can focus on these regions to try to deter- dangerous mutations in genes that can predispose an
18 1 Introduction to Genomics
Any two people – except for identical siblings – have Cancer genome sequencing
genomic sequences that differ at approximately
Healthy cells do accumulate mutations at a modest
0.1% of the positions. Measurements of multiple
rate. Cancer cells that have lost checks on accuracy
human genomes permit distinction between random
of DNA replication accumulate mutations copiously.
components of this variation, and those that system-
To distinguish variations arising from the disease
atically characterize different populations. Much but
it is preferable to compare the sequences from
not all of the variation takes the form of isolated base
tumour cells with those from normal cells from the
substitutions, or single-nucleotide polymorphisms.
same individual, rather than with a single reference
Sequence variation in humans has applications in
genome.
anthropology, to trace migration patterns, and in
personal identification to prove paternity or for crime
investigation. Sequence variability in other species
• There are many different applications of high-
gives clues to the history of the species, including but
throughput sequencing techniques, designed to
not limited to understanding the history of domesti-
address different types of questions.
cation of animals and crop plants.
In October 2010, Nature published the estimate Venter has also sequenced the genome of his
that over 2700 full human genomes will have been poodle, Shadow. They are therefore the first human–
sequenced by the end of that month, and suggested pet combination with both sequences known.
that by the end of 2011 the total will be over 30 000. Sequencing is now done under a variety of
A few of the individuals are known. These include auspices:
J. Craig Venter, the subject of the original Celera
Corporation genome sequencing project, James D. • A number of international organizations with ambi-
Watson, Bishop Desmond Tutu, and Stanford Uni- tious specific targets. These include the International
versity professor Stephen Quake. (Perhaps analysis HapMap Project http://hapmap.ncbi.nlm.nih.gov/
might reveal a genetic locus associated with the that focuses on the variations in sequences in pop-
Nobel Prize.) Actress Glenn Close had her genome ulations distributed around the world. They are
sequenced, motivated by a family history that in- collecting in particular an atlas of single-nucleotide
cluded several individuals who suffered from mental polymorphisms (SNPs) which are substitutions of
illness. The academic human DNA sequencing pro- individual bases. Clusters of SNPs that appear to
ject used DNA primarily from a donor from Buffalo, be inherited in tandem are called haplotypes. (See
New York. His identity has not been revealed. Chapter 2.)
The human genome and medicine 21
• The 1000-genome project is an extension of the can derive more and more clinically useful informa-
HapMap project towards complete genome data, tion from sequence data, the conclusion seems ines-
with an emphasis of discovering the conditions capable that anyone entering a hospital will have at
required to ensure appropriate data quality in pro- least a partial DNA sequence determination along
jects of this type (http://www.1000genomes.org/). with taking his or her pulse rate, temperature, and
Specific goals include careful sequencing of family blood pressure. (See Problem 1.7.)
groups (mother + father + child), and detailed
• It has taken 30 years for DNA sequencing to make
sequencing of 1000 protein-coding regions in 1000
the transition from Nobel Prize breakthrough
individuals. Although one may wonder to what
research to secondary school science projects.
extent the stated goals will be overtaken by the
growing ‘background’ of genome sequencing, the The teacher in a secondary school in New Jersey,
importance of the commitment of the international USA, organized a project to analyse DNA samples of
organizations to data quality control, curation, and the students in her class. This was not full-genome
free distribution should not be underestimated. sequencing but produced limited, genealogy-oriented
• Several companies now offer personal genome data. The students compared the results with their
sequencing. Many provide sequencing of mito- own cultural backgrounds.
chondrial DNA or individual loci in nuclear DNA, The following example is not human DNA sequen-
for tracing of ancestry. The application of DNA to cing, but usefully hints at the potential variety of
demonstration of legal relationships – most com- applications, and the relative ease with which they
monly, paternity testing – is well established. can be carried out. In 2008, two teen-age students
checked samples of fish from New York City sushi
Up to now, the cost of a full-genome sequence has bars, using a genetic-fingerprinting technique called
been prohibitive for most people if the motivation is DNA Bar Coding (see p. 117). They discovered that,
casual curiosity. However, it is already true that the of the samples they could identify, 25% were mis-
cost of sequencing a person’s DNA is comparable to labelled, in fact originating in less-expensive species
the cost of a night’s stay in hospital in the USA. As than advertised. They identified some restaurants
costs fall, and equipment becomes smaller, and we that were free of mislabelling.
The development and delivery of health care chal- tions – for instance, clean water. In all communities,
lenge biomedical science, engineering, industry, and preserving a mutually nurturing relationship with
government – both separately and in their coordina- nature will improve people’s mental and spiritual
tion. Not all components of these challenges are health, which – although in ways not entirely demon-
within the scope of this book. What is relevant is the strable yet in the laboratory – has important effects
increasing scientific sophistication of medical treat- on physical well-being.
ment and the shrinking of the distance and time Nevertheless, genomics and proteomics have
between laboratory discovery and clinical practice: played a central role in the recent transformation of
walls between ‘pure’ and ‘applied’ science have come medicine and surgery, making contributions to pre-
tumblin’ down. vention of disease, to detection and precise diagnosis,
Of course, immense improvements in human health and to effective treatment.
could be achieved by quite low-tech measures. Obvi-
ous examples include educating people about the
Prevention of disease
long-term dangers of obesity, smoking, and alcohol
abuse; use of seatbelts in automobiles and aeroplanes; • Vaccinations are pre-emptive strikes against infec-
and the provision, to the many people who lack tious diseases. They prime the immune system to
them, of basic medical care and hygienic living condi- recognize pathogens. Some vaccines are (almost)
22 1 Introduction to Genomics
surviving grandparents would be leading lives of apies, a procedure that is dangerous in terms of side
greatly reduced quality without regular treatment effects – sometimes even fatal – and in any case is
with drugs. The answers are eloquent. They engender wasteful and expensive. Treatment of patients for
fear of the new antibiotic-resistant strains of infec- adverse reactions to prescribed drugs consumes bil-
tious microorganisms. lions of dollars in health care costs. Conversely, being
The traditional drug development process involved able to predict individual patients’ responses can make
identifying a target – usually a protein – either from it possible to rescue drugs that are safe and effective
host or pathogen, the behaviour of which it is desired in a minority of patients, but which have been rejected
to affect. A drug is a molecule that intervenes in a before or during clinical trials because of inefficacy
living process by interacting with the target. or severe side effects in the majority of patients.
Recent scientific advances have accelerated the drug Some specific examples:
development process. Identifying metabolic features
unique to a pathogen helps to identify targets for • Acute lymphoblastic leukaemia is a childhood
antibacterial and antiviral agents. Human proteins cancer treated by thiopurines. In the patient, the
provide other drug targets to deal with molecular enzyme thiopurine methyltransferase breaks down
dysfunction or to adjust regulatory controls. Know- the drug. A genetic variant producing an inactive
ing the structure of a target permits computer-assisted enzyme threatens build-up of toxic levels of the
drug design by molecular modelling. drug in patients. Screening for the deficiency allows
monitoring to determine appropriate dosages.
• Abacavir is a drug used in treatment of AIDS.
Health care delivery
4–8% of patients show a serious, potentially fatal
Many countries in which advanced treatments are hypersensitivity reaction. This is correlated with
available face severe economic impediments to the MHC allele HLA-B*5701. Genomic screening can
equitable delivery of medical and surgical care to thereby detect potential hypersensitivity, and guide
their citizens. Research creates novel treatments, treatment.
but many of them are extremely costly. In the USA, • Cytochromes P450 are a family of enzymes in the
treatment for serious disease is already too expensive liver responsible for metabolizing a wide variety of
to be paid for out of most people’s earnings, health drugs. Sequence variations affect the activities of
insurance has not been universal and ‘safety nets’ these enzymes, to the point where lowered activity,
to protect the poor and elderly have been cut back. or loss of activity, can cause drug toxicity. Genetic
Baby-boomers, born in the late 1940s, are now tests for variations in cytochrome P450 genes warn
approaching elderly status. All of these factors will of potential overdose dangers. When J.D. Watson’s
put further pressure on the system. The Health Care genome sequence was determined, it emerged
and Education Reconciliation Act, which became US that he is homozygous for an unusual allele of
law in March 2010, should ameliorate the situation. the drug-metabolizing cytochrome gene CYP2D6.
Considerations of social and economic policy are Individuals with this genotype metabolize some
outside the scope of this book. However, science can drugs relatively slowly. Watson had been taking
make a contribution by improving the efficiency of b-blockers to reduce his blood pressure; however,
use of the resources available. the treatment made him unacceptably sleepy.
For example, many drugs vary in their effectiveness Based on information from his genome he is now
in different patients. A drug may be effective for some taking a lower dose.
patients and useless for others. Some patients may
tolerate a treatment easily; others may suffer side effects Our ability to rationalize and anticipate individual
ranging from discomfort through disability to death. differences in responses to drugs can improve the rate
Analysis of patients’ genes and proteins permits of clinical success. In improving the application of
selection of drugs and dosages optimal for individual resources by avoiding treatments that are useless or
patients, a field called pharmacogenomics. Physicians even harmful, science can have a direct effect on the
can thereby avoid experimenting with different ther- economics of health care delivery.
24 1 Introduction to Genomics
• Applications of genome sequencing to medicine already include: detection of diseases irrevocably implied by gene
sequences, such as Huntington’s disease, detection of diseases which an individual has an enhanced risk of developing
but for which the risks can be lessened by changes in lifestyle or prophylactic surgery, genetic counselling of prospective
parents, and the identification of optimal therapy depending on detailed diagnosis of the variety of the disease, and on
the expected reaction of the patient to different drugs.
High-throughput sequencing methods are generating and structure data publicly available. This increased
immense amounts of data. How can this information the harvest and permitted the combination of frag-
be archived and presented in useful form? This is the mentary and incoherent data into logically structured
responsibility of databases. collections. Databanks began to adopt fixed formats
Modern genomics combines biological data with and controlled vocabularies, and recognized the
computer science and statistics. From this union importance of careful curation and annotation of
have emerged both an intellectual framework and the data.
a technological toolkit. These contribute essential There followed a recognition of the power of
components to research and applications of geno- information-retrieval tools, which require imposing
mics and related fields. Access to data and software a structure on the data. These made possible the
has become as necessary a part of the infrastructure selective search and retrieval of data needed to
of research as distilled water. Computer storage and answer particular scientific questions. Other methods
software are essential for generating, collecting, permit numerical and/or textual analysis. Sequence
archiving, curating, distributing, retrieval, and ana- alignment is by far the most common example.
lysis of biological data. Many biologists specialize in For the development and provision of these and
computing; all find it an essential resource. many other tools, biology is indebted to computer
Sources of biological data include several high- science. Computer science is a young but nevertheless
throughput streams, including: mature field, moving swiftly on the back of devices
for high-capacity information storage and speed of
• Systematic genome sequencing
calculation. Its background was mathematics and
• Protein expression patterns engineering. Its goal has been the development of
• Metabolic pathways methods for making the most effective use of the
• Protein interaction patterns and regulatory networks available technology. This involves understanding
what determines the effectiveness of methods, devis-
• The scientific literature, including bibliographical
ing efficient methods for the problems we want
databases. Now that much of the scientific litera-
to solve, and production of computer programs to
ture is available online, it can itself be the subject
implement these methods.
of data mining.
The growth in the amount of information, together
Bioinformatics started with clerical projects for with the novel and unusual constellation of skills
archiving and distribution of data. Annotation and required for managing it, has led to the establishment
even curation of the data were, initially, minimal. of specialized institutions to organize the work.
Data entry structuring was rudimentary. Many data- The earliest databanks, the Protein Sequence Data-
banks were distributed as a series of flat files – that is, bank, started by Margaret O. Dayhoff at the National
plain text – to provide the lowest common denom- Biomedical Research Foundation, Georgetown Uni-
inator of intelligibility by different computer systems. versity, Washington DC, USA; and the Protein Data
Recognition of the importance of access to the Bank of macromolecular structures, started by Wal-
data led to policy decisions by journals and funding ter C. Hamilton of Brookhaven National Laboratory,
agencies that obliged scientists to make their sequence New York, USA, were outgrowths of – and originally
The evolution and development of databases 25
often found themselves uncomfortably competing for ilies, domains, and functional sites, and contains
funding with – research activities. links to others.
With increasing recognition of the importance of Alternatively, databanks are taking advantage of
databanks and the growing political attractiveness the World Wide Web to include a dense network
of their goals, several high-profile institutions have of links among different archives. Today, the quality
been established. Examples include the US National of a database depends not only on the information
Center for Biotechnology Information (NCBI) at it contains but also on the effectiveness of its links to
the US National Library of Medicine (NLM) in other sources of information. The growing import-
Bethesda, Maryland, USA, and the European Bio- ance of simultaneous access to databanks has led
informatics Institute outstation of the European to research in databank interactivity – how can data-
Molecular Biology Laboratory, in Hinxton, Cam- banks ‘talk to one another’ without sacrificing the
bridgeshire, UK. freedom of each one to structure its own data in
In addition to archiving and curating, databanks appropriate ways?
have been active in developing software for informa- Specialized user communities may extract subsets
tion retrieval and analysis. This is an integral part of of the data, or recombine data from different sources,
making the resources in their care available on-line. to provide specialized avenues of access. Such ‘bouti-
However, the advantages of vesting responsibility for que’ databases depend on the primary archives as
the archives in monolithic organizations do not pre- sources of information, but redesign the organization
clude multiple routes of access to the data. Colloqui- and presentation. Indeed, different derived databases
ally, anyone can design an individual ‘front end’. can ‘slice and dice’ the same information in different
ways. Similarly, an individual may improve a method
A databank without effective modes of access is merely a for solving an important problem – for example,
data graveyard. identification of genes within genome sequences –
and make the method available via a web site.
A reasonable extrapolation suggests the idea of
Databank evo-devo specialized ‘virtual databases’, grounded in the
Archiving projects originally tended to specialize, archives but providing individual scope and function,
matching the nature of the data with the skills of the tailored to the needs and achievements of individual
scientists curating it. Typically, the Protein Data Bank research groups or even individual scientists.
employed crystallographers, whereas genomic data-
banks attracted sequence specialists. The databanks
Genome browsers
tended to develop along different lines, dictated in
part by the nature of the data. There are many databases in the field of molecular
This independence and divergence has some draw- biology. We shall survey them in Chapter 3. How-
backs, notably the difficulty of answering the questions ever, a particular species of database that deals with
of greater subtlety that arise in studying relationships full-genome sequences and related information is a
among information contained in separate databanks. genome browser.
For example: for which proteins of known structure Genome browsers are projects designed to organize
involved in diseases of purine biosynthesis in humans and annotate genome information, and to present
are there related proteins in yeast? We are setting it via web pages together with links to related data,
conditions on known structure, specified function, such as evolutionary relationships or correlations
detection of relatedness, correlation with disease, with disease. Genome browsers are like encyclopae-
and specified species. We need links that facilitate dias. There is a commitment to linking the actual
simultaneous access to several databanks. sequence data to as many as possible of the resources
In principle, the problems would go away if all of data about an organism. Two major genome
the databanks merged into one. To some extent this browsers are Ensembl, a joint project of the Sanger
is happening. For example, the umbrella database Centre and the European Bioinformatics Institute in
UniProt integrates the contents, features, and anno- Hinxton, UK, and the University of California at
tation of several individual databases of protein fam- Santa Cruz Genome Browser in the USA.
26 1 Introduction to Genomics
Figure 1.9 Globin cluster, from the genome viewer of the US National Center for Biotechnology Information (NCBI).
Chromosome 16 α-Globin gene cluster expression passes between genes in order of their
position on the chromosome.
ζ ψζ αD ψα1 α2 α1 θ1 Focusing down still more finely within the a-gene
cluster, we can see the structure of the gene for HBA,
Chromosome 11 β-Globin gene cluster the a-subunit of adult haemoglobin (see Figure 1.12).
Like most other eukaryotic genes, it is divided into
ε Gγ Aγ ψβ δ β exons (expressed segments of the gene) and introns
Figure 1.10 Distribution of protein-coding genes and
(intervening regions) (Figure 1.13).
pseudogenes in the a-globin cluster on chromosome 16, Now we have reached the level of the sequence
and the b-globin cluster on chromosome 11. itself (see Figures 1.14 and 1.15). The protein folds
28 1 Introduction to Genomics
Gene duplication
Divergence
Figure 1.11 Haemoglobin genes
and pseudogenes are distributed α β
on their chromosomes in a way → Translocation →
that appears to reflect their Further duplications
α β
evolution via duplication and and divergence
divergence. That is, adjacent
ζ α ε γ β
genes are similar in sequence. The
evolutionary tree can be drawn
without any intersecting lines. ζ ψζ ζD ψα1 α2 α1 θ1 ε Gγ A γ ψβ δ β
Figure 1.12 The structure of the gene for the a-subunit of human haemoglobin.
The evolution and development of databases 29
Figure 1.13 Structure of the human a1-globin gene (HBA): 3′-untranslated region in red, exons in green, introns in black, and
5′-untranslated region in cyan. This exon/intron pattern is conserved in many expressed vertebrate globin genes, including haemoglobin
a and b chains, and myoglobin. In contrast, the genes for plant globins have an additional intron, genes for Paramecium globins one
fewer intron, and genes for insect globins contain none. The gene for human neuroglobin, a homologue expressed at low levels in the
brain, contains three introns, like plant globin genes.
Different globins diverged from a common cytoglobin – are more distant relatives. Correspond-
ancestor ing globins in related species, such as human and
horse, have also diverged. In general, the divergence
Differences in gene sequences, differences in the cor- at the molecular level parallels the divergence of the
responding amino acid sequences, and differences species according to classical taxonomic methods.
in three-dimensional structure reflect evolutionary But the power of comparative genomics and pro-
divergence. The globins within the a and b clusters teomics in tracing precise relationships both within
are more closely related than members of the a clus- and among species is immense.
ter are to members of the b cluster. Other globins in The basic tool for investigating sequence divergence
the human genome – myoglobin, neuroglobin, and is the multiple sequence alignment (see Figure 1.17).
Protein evolution: divergence of sequences and structures within and between species 31
Figure 1.17 (a) Multiple sequence alignment of five mammalian globins: sperm whale myoglobin, and the a and b chains of human
and horse haemoglobin. Each sequence contains approximately 150 residues. In the line below the tabulation, upper-case letters
indicate residues that are conserved in all five sequences, and lower-case letters indicate residues that are conserved in all but sperm
whale myoglobin. (b) Multiple sequence alignment of full-length globins from eukaryotes and prokaryotes. Many fewer positions are
conserved than in the mammals-only case. In the line below this tabulation, upper-case letters indicate residues that are conserved
in all eight sequences, and lower-case letters indicate residues that are conserved in all but the bacterial globin.
32 1 Introduction to Genomics
Knowledge creates power. Power requires control. easier to identify victims of death on a battlefield or
Control requires decisions. after a terrorist attack. Most people recognize that
Advances in genomics have created problems that extensive data on genome sequences, in a form that
individuals and societies must face. In setting up can be correlated with clinical records, would be an
the Human Genome Project, the US Department extremely valuable source for research. Questions
of Energy and NIH recognized the importance of have been raised, however, over:
ethical, legal, and social issues by allocating 3–5%
• Privacy issues: should inclusion in a databank of
of the funding to them. We shall discuss these topics
genomic information about individuals require the
throughout the book, in context, keeping several
individuals’ consent?
general categories in mind.
DNA databases containing information about • What data should be included? Should the data
every citizen of a country are technically feasible. be limited to the minimum required for standard
Routine testing of all newborns for genetic diseases identification procedures, or be more extensive?
is common, and it would be simple to determine the (For instance, should sufficient additional data
sequences of the regions used by law-enforcement be kept to identify physical features or ethnic
agencies for identification. Fairly soon it will be characteristics?)
relatively inexpensive to determine complete genome • Access: who should have access to the information?
sequences of individuals.
Should this information be determined? If it is
determined, who should have access to it? Databases containing human DNA sequence
Provided DNA sequence information is kept as information
private as normal medical records, sequencing can There are two major national repositories of human
benefit individuals. genome information in the UK (see Box 1.8).
More controversial questions arise in allowing the
information to be collected into a generally accessible • The National DNA Databank (NDNAD) primarily
databank. Most people would have no problem supports law enforcement agencies.
accepting an effort that would assist in the capturing • The UK BioBank has the goal of improving pre-
of criminals, especially those criminals that are likely vention, diagnosis, and treatment of illness. It has
to repeat their offences. Most people would have no amassed a large collection of clinical data, and bio-
problem accepting an effort that would make it logical samples to provide sequence information.
Ethical, legal, and social issues 33
Also based in the UK are the comprehensive nucleotide ‘that for most of the time that the NDNAD has been
sequence databanks at the European Bioinformatics in existence there has been no formal ethical review
Institute and The Sanger Centre. of applications to use the database and the associated
In England and Wales, police have been allowed to samples for research purposes’. If the situation in the
take samples, from any individual arrested on suspi- UK could have been described as ‘extremely regret-
cion of all but the most minor crimes, and to retain table’, one need not be a Colonel Blimp to shudder to
the derived information and the samples even if the think what things are like elsewhere (see Box 1.8).
person arrested is never charged. Consent is not It has been suggested that the NDNAD be extended
required. Access to the database is not strictly limited to the entire population. The current contents of the
to police, but has been used to support research database show ethnic and gender biases, which a uni-
projects without consent of the individuals repre- versal database would eliminate. A higher fraction
sented in the database. of reported crimes might be solved, but there is
In the current political climate, there is intense no consensus about how significant this would be.
pressure to tip the scales towards giving governments Arguments against such an extension include what is
powers that can be used to protect their citizens. The at the moment still a very high cost. There is also
dangers are that such powers, although they appear intense debate over whether the loss of individual
to be innocuous to innocent people, may be com- privacy would justify the benefits to society.
patible with abuse if safeguards are inadequate, or An important point about privacy of computer
– in the worst case – deliberately violated. Even in the databanks is that, even if there is consensus that certain
United Kingdom, a House of Commons committee data should be kept confidential, computer security is
in 2005* described as ‘extremely regrettable’ the fact simply not up to the task of ensuring that it will be.
The UK National DNA Database (NDNAD) is one of steadily growing through incremental legislation.
the largest forensic DNA collections in the world. It stores Roughly speaking, recordable offences include all but
both a database of sequence-derived information, and the most trivial antisocial actions. For example, under
the biological samples from which the sequences were the Football (Offences) Act of 1991, ‘unlawfully going
determined. In mid-2010 it contained profiles of an estim- onto the playing area’ is a recordable offence.
ated 5.4 million individuals. Of these, 78.50% are male
and 20.82% female; a small number are unassigned. The For purposes of elimination, the NDNAD also contains
NDNAD contains personal identification ‘DNA finger- samples from persons present at a crime scene, and from
prints’, and gender. The probability that sequences from police officers.
a sample collected at a crime scene will match an entry in
(b) Samples collected at a crime scene. They may originate
the database is over 50%!
from the perpetrator of a crime, from a victim, or even
Entries in the NDNAD are of two types:
from someone who had been at the scene at some
(a) Samples from known individuals. Police can collect time other than during the commission of the crime.
samples, without consent, from anyone suspected of a In the event of a match, these become samples from
‘recordable’ offence, not only before trial and possible known individuals. (In the case of a partial match, they
conviction, but even before being charged with a may suggest that the crime scene sample came from a
crime. In the UK, the list of recordable offences is near relative of the individual matched.)
➔
* House of Commons Science and Technology Committee, Forensic Science on Trial, 29 March 2005. http://www.
publications.parliament.uk/pa/cm200405/cmselect/cmsctech/96/96i.pdf
34 1 Introduction to Genomics
In the UK, the governing legislation has been in an active corresponding UK NDNAD, but represents a smaller per-
state, particularly with respect to retention of samples; and, centage of the national population.
under that heading, particularly with respect to retention The Genetic Information Nondiscrimination Act (GINA)
of samples from people not convicted of a crime. The of 2008 aimed at protecting individual privacy, with respect
Criminal Justice Act of 2003 authorized the widespread to genomic information. It prohibits:
collection of samples for DNA profiles in England and
(a) Health insurance companies from reducing coverage
Wales, without consent. (This was extended to Northern
or increasing prices to individuals based on information
Ireland in the next year.) The act also allowed the indefinite
from genetic tests.
retention of the information, even if the suspect were
never charged with a crime. In 2008, the Counter-Terrorism (b) Employers from making hiring decisions based on DNA
Act extended the criteria that allow police to demand sequence information.
samples for DNA sequencing. (c) Companies from demanding or even requesting a
The law in Scotland, in this as in many other respects, genetic test.
varies from that of England and Wales. Scotland maintains
However, the law does not apply to people applying for life
a separate DNA database, and shares results with England.
insurance or long-term care or disability insurance.
One salient legal difference is that Scotland does not per-
The Health Care and Education Reconciliation Act
mit automatic indefinite retention of samples from people
became US Federal law in March 2010. In theory, it should
not convicted of a crime.
allay some of the fears associated with the potential con-
The European Court of Human Rights has aligned itself
sequences of improper release of genetic information.
more closely with the Scottish law.
A case against the UK government was brought to In principle, use of DNA samples in research should
the European Court by two individuals who wanted their require consent of the individuals donating the material:
identifying information expunged from the NDNAD and consent not only for its collection, but for the specific uses
their samples destroyed. One was tried for attempted to which the samples will be put. There have been cases
robbery but acquitted, the other was never charged. The of ‘research goal creep’ in which permission was granted
court held that there had been violation of Article 8 of the for analysis of restricted scope, but the samples were
European Convention for the Protection of Human Rights subsequently used for other studies.
and Fundamental Freedoms: A US case testing the propriety of use of samples col-
‘In conclusion, the Court finds that the blanket and indiscriminate
lected voluntarily from subjects was settled in April 2010.
nature of the powers of retention of the fingerprints, cellular sam- A Native American group, the Havasupai, who inhabit
ples and DNA profiles of persons suspected but not convicted of an inaccessible area of the Grand Canyon, has among the
offences, as applied in the case of the present applicants, fails to highest known incidences of Type II diabetes. In 1991,
strike a fair balance between the competing public and private 55% of Havasupai women and 38% of Havasupai men
interests and that the respondent State has overstepped any were affected. Scientists from Arizona State University col-
acceptable margin of appreciation in this regard. Accordingly, the lected samples from the group, and carried out research
retention at issue constitutes a disproportionate interference with projects, believed – by the subjects – to be focused on sus-
the applicants’ right to respect for private life and cannot be
ceptibility to diabetes. However, using the same samples,
regarded as necessary in a democratic society’.
the scientists also investigated genetic susceptibility for
In the US, in the absence of overriding Federal legisla- schizophrenia, and evidence for migration patterns. The
tion, laws of individual states govern DNA collection. The subjects objected. Researchers pointed to the wording in
variety of guidelines for collection and retention of samples the consent form; representatives of the Havasupai alleged
expressed in state laws have had a patchy career in the (among other things) that the wording was too vague to
(state) courts. Some have been declared unconstitutional. constitute truly informed consent. The ensuing lawsuit
In the US, the analogue of the UK NDNAD is the was settled, with the Arizona state agency responsible for
Combined DNA Index System and the National DNA Index the university agreeing to pay the Havasupai US$7 000 000,
System (CODIS/NDIS), maintained by the Federal Bureau and to return the samples.
of Investigation (FBI). In late 2010, NDIS contained almost The conclusion is that, in the UK and US at least, legis-
10 million DNA profiles. This is about twice the size of the lation is moving in the direction of greater protection of
Recommended reading 35
privacy of DNA sequence information. Belief in this protec- national sharing of identification information, individuals
tion, which may in some respects be illusory, may lead to need to be concerned not with the countries with the most
increased genetic testing, both in regular medical practice, secure databases, but those with the least secure ones.
and by private companies. (a) Like mailing lists, testing (c) Experience has shown that much private information
companies may have the right to sell genetic information in fact becomes disseminated, either through accident
to outside parties. (b) Given the increased degree of inter- or design.
Genomic databases are useful in identifying individuals tain such databases, there is a tension between the desire
from samples collected at crime scenes. In drafting the laws to protect society against offenders and upholding indi-
granting law-enforcement agencies authorization to main- viduals’ rights of privacy.
● RECOMMENDED READING
• The first two books are sources for the history of the development of molecular biology after
the Second World War. Rosenfield, Ziff, & Van Loon’s is a less formal but not less serious
account of the founding people and events.
Judson, H.F. (1980). The Eighth Day of Creation: Makers of the Revolution in Biology. Jonathan
Cape, London.
de Chadarevian, S. (2002). Designs for Life/Molecular Biology after World War II. Cambridge
University Press, Cambridge.
Rosenfield, I., Ziff, E.B., & Van Loon, B. (1983). DNA for Beginners. Writers & Readers
Publishing, London.
• The next five publications deal with genomics and the Human Genome Project, placing it in
broad context. The book by Sulston & Ferry is a personal account by one of the major players.
Ridley, M. (1999). Genome: The Autobiography of a Species in 23 Chapters. HarperCollins
Publishers, New York.
Lander, E.S. & Weinberg, R.A. (2000). Genomics: journey to the center of biology. Science 287,
1777–1782.
Wolfsberg, T.G., Wetterstrand, K.A., Guyer, M.S., Collins, F.S., & Baxevanis, A.D. (2002). A user’s
guide to the human genome. Nat. Genet. 32 (Suppl.), 1–79.
Sulston, J. & Ferry, G. (2002). The Common Thread: A Story of Science, Politics, Ethics and the
Human Genome. Bantam Press, London.
Choudhuri, S. (2003). The path from nuclein to human genome: a brief history of DNA with a
note on human genome sequencing and its impact on future research in biology. Bull. Sci.
Technol. Soc. 23, 360–367.
The 18 February 2011 issue of Science magazine contains several articles recognizing the tenth
anniversary of the sequencing of the human genome.
36 1 Introduction to Genomics
Exercises
Exercise 1.1 Make a very rough estimate of the average density of protein-coding genes in the
human genome. (Total genome size ∼3 × 109 bp; total number of genes ∼3 × 104 genes.)
Exercise 1.2 Assume that most eukaryotes have approximately 25 000 protein-coding genes,
and that the average eukarotic protein has a length of 300 amino acids. Assume that other
functional regions, including RNA-coding genes and control regions, do not require more base
pairs that the protein-coding genes themselves (an assumption for which there exists not the
slightest justification other than ignorance). Estimate the minimum size that a eukaryotic genome
could potentially have.
Exercise 1.3 For the standard genetic code (Box 1.1), give an example of a pair of codons related
by a synonymous single-site substitution (a) at the third position and (b) at the first position.
(c) Give an example of a pair of codons related by a non-synonymous single-site substitution
at the third position. (d) Can a change in the second position of a codon ever produce a
synonymous mutation?
Exercise 1.4 (a) Is it possible to convert Phe to Tyr by a single base change? If so, what would be
possible wild-type and mutant codons? (b) Is it possible to convert Ser to Arg by a single base
Exercises, problems, and weblems 37
change? If so, what would be possible wild-type and mutant codons? (c) What is the minimum
number of base substitutions that would convert Cys to Glu? (d) In the evolution of an essential
protein encoded by a single gene, a Trp is converted to a Gln by two successive single-base
changes. What is the intermediate codon?
Exercise 1.5 RNA editing in the mitochondria of higher plants often changes cytosine to uracil at
the second position of codons. What amino acid changes could this effect? Note that higher plant
mitochondria use the standard genetic code.
Exercise 1.6 A single base-pair deletion in an exon in a protein-coding gene would be very serious
because it would throw off the reading frame. What would you expect to be the effect of a single
base-pair deletion in a gene for a structural RNA molecule? (Consider transfer RNA, for example;
see Figure 1.4.)
Exercise 1.7 Here is a fragment of length 200 from a mitochondrial genome. If this fragment were
processed by a sequencer giving read length 35, what would be (a) the result of a single-end
read?, (b) the result of a paired-end read? (See Figure 1.7.)
Exercise 1.8 Which of the following bands of human chromosome 16 are gene-rich: p13.3,
q22.1, q11.2? (See Figure 1.9.)
Exercise 1.9 On a photocopy of Figure 1.9, mark the approximate position of the a-globin gene
cluster.
Exercise 1.10 From Figure 1.10, estimate the number of genes per base pair in the globin region.
Exercise 1.11 On a photocopy of Figure 1.14, mark with highlighters the regions shown in
Figure 1.13, in the same colours as in Figure 1.13.
Exercise 1.12 On a photocopy of the amino acid sequence of the human haemoglobin a chain in
Figure 1.15(b), indicate the regions arising from the three exons in the gene.
Exercise 1.13 What are the symmetries of the structure of the haemoglobin molecule? (See
Figure 1.16.) If the a and b subunits were identical, what additional symmetries would there be?
Exercise 1.14 On a photocopy of the sequence of the Tn3 transposon of E. coli (Box 1.6), indicate
the extents of the inverted terminal repeats.
Exercise 1.15 How might genome sequencing be applied to the conservation of endangered
species?
Problems
Problem 1.1 Draw a rough sketch of the outline of a small eukaryotic plant cell (representing
10 000 nm diameter) to fill almost a whole page. Within this outline, draw in, at scale, the
cellular components listed in the table. At the same scale, draw an E. coli cell (approximate
diameter 2000 nm).
Problem 1.2 From the following data, compute the number of genes per Mb for mitochondria,
rickettsia, chloroplasts, and cyanobacteria. Compare the mitochondria with the rickettsia and
the chloroplasts with the cyanobacteria. Is the smaller genome size of the organelles largely the
result of eliminating non-functional DNA, or loss of genes, some of which were transferred to
the nuclear genome?
Problem 1.3 It is estimated that the human immune system synthesizes 108–1010 antibodies.
The portion of a typical antibody active in binding antigen consists of two variable domains each
containing 100 amino acids. If every variable domain were separately encoded in the genome but
could pair promiscuously, then the number of variable domains that needed to be encoded would
be of order of magnitude the square root of the total number of antibodies. How many bases
would be required to encode separately all the variable domains? Compare this with the size of
the human genome.
Problem 1.4 Human haemoglobin a-subunits have helices A, B, C, E, F′, F, G, and H. Human
haemoglobin b-subunits have helices A, B, C, D, E, F′, F, G, and H. The sites of interaction of the
histidine residues with the iron in the haem group are in the F and G helices. In Figure 1.16, in
what colours do the a-subunits appear and in what colours do the b-subunits appear?
Problem 1.5 Outline how you would search for tRNA genes in a genome sequence. Assume that
the lengths of the double-helical regions and the lengths of the single-stranded regions between
them vary within relatively tight limits and that there are only a limited number of deviations from
perfect base pairing in the helical regions. How would your method have to be modified if tRNA
genes contained introns?
Problem 1.6 In Figure 1.17(a), (a) find five positions in which the amino acid is the same in all
haemoglobin chains but different in myoglobin. (b) Find four positions at which human and horse
a chains have the same amino acid, human and horse b chains have the same amino acid, and
myoglobin has a different amino acid. (c) Find two positions at which myoglobin and human
and horse b chains have the same amino acid, but the a chains have a different one. (d) Classify
the following sequence. Is it a haemoglobin a chain, a haemoglobin b chain, or a myoglobin?
MGLSDAEWQLVLNVWGKVEADIPGHGQDVLIRLFKGHPETLEKFDRFKHLKTEDEMKASE
DLKKHGTTVLTALGGILKKKGQHEAEIQPLAQSHATKHKIPVKYLEFISEAIIQVIQSKH
SGDFGADAQGAMSKALELFRNDIAAKYKELGFQG
Problem 1.7 The annual birth rate in the UK is approximately 700 000. Two ‘back-of-the-envelope
calculations’: (a) If the goal of a $US1000 genome is attained, what would be the total cost (of)
sequencing 700 000 babies? The current annual budget of the UK National Health Service is
£120 000 000 000. What percentage of the current NHS budget would be required to sequence
the genomes of all babies born in the UK? (Whether the money would be available, and, if it
were, whether sequencing would be the best use of it, are separate questions.) (b) If the sequence
data were stored completely and independently, by what percentage would they increase the
current holdings of the Nucleotide Sequence Databanks?
Exercises, problems, and weblems 39
Problem 1.8 Suppose that you knew your personal DNA fingerprint in the format stored in
national DNA databases. Assume that the information did not reveal that you were unusually
susceptible to any known disease. What are the arguments for and against your voluntarily
depositing these data in the national DNA database of the country of your residence? What
conditions would you want to impose on the use of the data?
Problem 1.9 Outline the arguments for and against the use of animals for testing in drug
development. What restrictions might be imposed that would mitigate any of the arguments
against use of animals in testing? How and to what extent would the effectiveness of drug
development suffer if these restrictions came into force?
Weblems
Weblem 1.1 At the time of preparing the manuscript of this book, the sequencing capacity of the
largest dedicated genomics institute was estimated as 3.2 × 1012 bp per day. This corresponds to
1000 human genome equivalents. What is the corresponding current maximal sequencing
capacity?
Weblem 1.2 For each of the following traits of any human being, estimate the contributions of
(1) sequences of particular genes, (2) life history and environment, (3) epigenetic factors: (a) blood
type; (b) adult height; (c) native language; (d) whether or not a person will develop Huntington’s
disease; (e) whether or not a person will develop emphysema; (f) whether or not a person will
develop Angelman syndrome.
Weblem 1.3 Which human chromosomes contain genes for ubiquitin?
Weblem 1.4 The gene for insulin is 1.7 kb long; the LDL receptor gene is 5.45 kb; and the
dystrophin gene is 2400 kb. (a) What are the lengths of the amino acid sequences of these
proteins? (b) What are the ratios of total exon length/total gene length and total intron length/
total gene length for these three genes?
This page intentionally left blank
CHAPTER 2
LEARNING GOALS
• Recognizing that genomics has transformed our approaches to all the classical topics of biology
and medicine.
• Appreciating that despite the individual variation among genomes within and among
populations of humans, other animals, and plants, the idea that species are discrete entities
is still valid. For prokaryotes, the situation is somewhat murkier.
• Adding to Crick’s central dogma – DNA makes RNA makes protein – recognition of the
necessity for regulating transcription, translation, and protein activity, for the health of cells
and organisms.
• Distinguishing the static components of genomes – the full nucleotide sequences that appear
in databases – from the dynamic aspects involved in the responsiveness of cells to internal and
external signals, and in the programmes of development seen in higher organisms.
• Knowing the different types of control mechanisms in organisms, and their points of application,
including but not limited to regulation of activities of proteins and control over gene expression.
• Realizing the importance of comparative genomics in revealing conserved elements that are
likely to have interesting functions.
• Knowing the different mechanisms by which mutations can affect human health.
• Being familiar with a number of important diseases with genetic components, and knowing
which ones can be treated, and for which ones lifestyle adjustments can reduce the danger
inherent in genetic risk factors.
• Recognizing that the roster of species has been in flux. Mass extinctions have characterized the
history of life. We are currently in another period of mass extinction.
42 2 Genomes are the Hub of Biology
A genome sequence belongs to an individual organ- sible the development of aerobic metabolism. The
ism. But, if the goal is to ‘see life clearly and to see it Earth has been the theatre of an enormous network
whole’ it is essential to relate genomes to one another. of such interactions.
Genome comparisons are necessary across both space A feature of the history of life is the origin of new
and time. We analyse differences in genomes within species and the extinction of many others. In some
and among species that are currently extant, or that cases, extinctions arise from external events – for
are accessible from sequencing preserved specimens instance, an asteroid impact. In others, extinction is
of extinct species. Within species, we study genetic the result of deliberate activity – sailors killing the
variation within and between populations. In humans, dodos of Mauritius, or twentieth-century scientists
the intraspecific variation reveals the history of our exterminating the smallpox virus in the wild. Today,
origin and dispersal, and how we have adapted to habitat destruction threatens many other species,
different environments and lifestyles. An example is often an inadvertent result of human activity. To
the development of adult lactose tolerance, a con- some threatened extinctions we can however plead
comitant of the dietary change associated with cattle not guilty: Examples include the molicutes infection
domestication. The original trait, loss of the ability to (‘elm yellow’) which is devastating the elm trees
digest lactose after infancy – at the time of weaning of the eastern United States and southern Ontario
– persists in many populations. For these populations, (Figure 2.1), the fungal ‘white-nose disease’ threaten-
alternative sources of calcium include fermented ing bats in eastern North America, and devil facial
dairy products such as cheese or yoghurt – bacteria tumour disease, a transmissible cancer in Tasmanian
hydrolyse the lactose – or soybean. (It is said that devils (Sarcophilus harrisii), believed to have origin-
‘The soybean is the cow of Asia’.) ated as a single mutation in a single individual.
Comparing genomes between species can reveal their We can study the present as thoroughly as we wish,
relationships and how they achieve their similarities and the past as extensively as we can. What of the
and differences. We share 96% of our genomes with future? One thing in which we can have confidence is
our nearest relatives, the chimpanzees, and many of our that molecular biology will give us greater control
proteins are identical. This small amount of genomic over living things, including but not limited to our-
change must account for the differences between selves. Clinical applications, of genomics and of
humans and chimps. Indeed, genomics is essential to other fields, have the potential to enhance the health
understanding and drawing species boundaries. As of humans, and of animals and plants. It is already
useful a concept as a species may be, there is consen- possible to intercede against some genetic diseases,
sus that its definition is a very tricky problem. by replacement of dysfunctional proteins through
Genomes also contain records of their evolution-
ary history, which we can try to reconstruct. The
organisms that populate the Earth are the product
not only of the response to an imposed geological
environment, but very largely show effects of interac-
tions among individuals and species. There are many
obvious examples of competition among members of
the same species, and attack and defence between
species. Our struggle against pathogenic viruses and
bacteria are a salient example. Moreover, in the lon-
ger term the geological environment has in crucial
respects been imposed by life. The early organisms Figure 2.1 Elm tree infected with elm yellows.
that released oxygen through photosynthesis changed Pennsylvania Department of Conservation and Natural Resources – Forestry
the composition of the atmosphere. This made pos- Archive, Bugwood.org
Expression patterns 43
direct administration; for example, insulin therapy Genomes are now central to all of academic bio-
for diabetes and blood clotting factors for haemo- logy and clinical applications. In this chapter we shall
philia. At the forefront of clinical developments are emphasize the integrality of the relationships between
methods of rectification of mutant genes, through genomes and other, formerly independent, areas of
gene delivery by viruses (X-linked adrenoleukodys- science. We shall begin with a cell, and then widen our
trophy), or by introduction of functional genes point of view successively to organisms, populations,
through stem cells (a1-antitrypsin deficiency). species, and the biosphere as a whole.
In 1958, F. Crick proposed the ‘central dogma’ of of the activities of molecules already present in the
molecular biology: cell, and control of synthesis of proteins and RNAs.
An example of a mechanism of control over the activ-
DNA makes RNA makes Protein.
ity of enzymes present in the cell would be feedback
Like many other dogmas, it provides important inhibition, in which the product of a concatenation
insights, even if it is not the whole story. of metabolic reactions inhibits an enzyme catalysing
A less concise statement of the dogma would be: a step early in the sequence. Many other mechanisms
DNA sequences govern the synthesis of nucleotide exist to regulate the concentrations of enzymes and
sequences of RNA (transcription), which in turn gov- other proteins, by control over levels of transcription
ern the synthesis of amino acid sequences of proteins of the genes that encode them.
(translation). In his formulation, Crick emphasized Thus cells contain two parallel networks: an
the one-way flow of information: that amino acid enzymatic network manipulating metabolites, and a
sequences do not dictate the synthesis of nucleotide regulatory network manipulating genes, transcripts,
sequences of RNAs – this is still true – and that RNA and proteins.
sequences are not transcribed into DNA sequences.
We now know that some viruses can ‘reverse tran-
scribe’ RNA sequences into DNA.
Also sacrificed to concision in Crick’s statement of • The central dogma – DNA makes RNA makes protein
the dogma is the question of regulation and control. – describes not only a series of molecular events but a
In a healthy cell, the traffic through the dense net- direction of information transfer. Information sensing
and response are also essential for integration and
work of metabolic pathways is coordinated, so that
coordination of the activities and expression patterns
no intermediates build up to unwanted levels, and
of many molecules. In this way, cells can achieve stabil-
adequate amounts of products are created when and ity and robustness.
where they are needed. This involves regulation both
Expression patterns
The ∼23 000 protein-coding genes in the human viruses. In prokaryotes, for instance, Escherichia coli
genome, exclusive of the immune system, can give will show robust synthesis of the genes in the lac
rise to at least as many proteins – many more, if operon only if growing on a medium containing lac-
splice variants are taken into account. However, a tose. Human cells in different organs will synthesize
healthy cell must synthesize only those proteins neces- proteins appropriate for their cell types.
sary for the requirements of its physiological state, Not only must each cell choose which proteins to
and its differentiated type. Regulation of expression synthesize; it must control the rate of production of
is common to all cellular life forms and even to some each. A variety of mechanisms achieve this control.
44 2 Genomes are the Hub of Biology
Some involve the binding of proteins to specific DNA Gene regulation in eukaryotes is more complex.
sequences, to control protein synthesis at the level of Transcription regulators bind to DNA at positions
transcription. proximal to the gene as in prokaryotes, but also at
remote sites. The control of human b-globin expres-
sion illustrates such a scheme (see Box 2.1). Regula-
• The genome sets the parameters that circumscribe a tory interactions also govern the expression of other
potential life. Some constraints are tight, others rela- transcription factors. The resulting control networks
tively loose. Conversely, life implements the genome, show far greater complexity, in both their logic and
in the form of synthesis of RNAs and proteins, and in their dynamics, than those of viruses or prokaryotes.
the regulatory mechanisms that keep cellular activities
The gift of complexity is robustness. Eukaryotic
stable and robust.
control networks show an ability to reprogram them-
selves and to respond to stimuli by changing cell
state. The source of robustness appears to be redun-
dancy. Yeast (Saccharomyces cerevisiae) has about
Regulation of gene expression
6000 genes. Under ‘normal’ – non-stress – condi-
Living things must regulate the synthesis of proteins tions, about 80% of them are being expressed. It is
encoded in their genomes. Transcription and transla- also true that yeast can survive approximately 80%
tion must be dynamic – to produce the right amount of single-gene knockouts. (It would be interesting to
of the right protein at the right time at the right place. know the overlap of these sets!) Many expressed
In this way, cells can respond to stimuli by altering genes must be redundant, and redundancy provides
their physiological state, or even their physical form. robustness. (Gene regulatory networks are the sub-
The driving force for these changes in profile of pro- ject of Chapter 11.)
teins produced may be changes in the environment, Other mechanisms of transcription regulation in
or internal signals directing different stages of the cell eukaryotes involve changes in patterns of methyla-
cycle or developmental programmes. We have men- tion of DNA associated with changes in the structure
tioned that the appearance of lactose in the medium of chromatin. Eukaryotic chromosomes contain
can trigger transcription of the lactose operon in complexes of DNA with histones (see Figure 1.2).
E. coli. Transcription of another operon, encoding Chromatin remodelling is an important mechanism of
enzymes for the biosynthesis of the amino acid tryp- transcriptional control. Reversible chemical modifica-
tophan, will be repressed if tryptophan is present tion of histones, by a mellifluous variety of reactions
in adequate concentrations. Similarly, a human cell including deacetylation, methylation, decarboxylation,
may differentiate into a neuron, sprouting dendrites phosphorylation, ubiquitinylation, and sumoylation,
and an axon, and synthesizing tissue-specific or even leads to alterations of the DNA–histone interactions
cell-specific proteins. that render transcription-initiation sites more or less
The central dogma of DNA → RNA → protein accessible.
suggests several possible leverage points for regula- In differentiation, DNA methylation is a regulatory
tion of protein expression. mechanism that survives cell division (see Box 2.2).
Methylation of cytosine in CpG islands silences the
Most control takes place at the level of adjacent genes, possibly by stimulating chromatin
transcription remodelling. When a cell divides, enzymes copy the
In prokaryotes, a specific focus of transcriptional methylation patterns, preserving the settings of the
regulation is at or near the binding site of RNA poly- regulatory switches.
merases to DNA, just upstream of (5′ to) the begin-
ning of the gene. Repressors can turn off transcription
• CpG islands are regions of high GC content, rich in the
by occluding the binding site, blocking polymerase
dinucleotide sequence GC, that appear at the 5′ ends
activity. In contrast, promoters can actively recruit
of vertebrate genes. Methylation of C residues silences
polymerases through cooperative binding, along genes.
with polymerase, to a site on the DNA.
Expression patterns 45
The protein-coding genes of the globin loci interact with degree of their exposure is correlated with transcriptional
many control regions. The b-globin region includes pro- activity.
moters proximal to individual genes; a locus control region Regulation of b-globin expression involves interaction of
occupying a region between 6 and 22 kb upstream of the the locus control region with proximal promoters associ-
most 5′ gene (e); a 250 bp pyrimidine-rich region 5′ to the ated with specific genes. The interactions are mediated by
d gene (YR); and enhancer regions, which may appear on a large complex of proteins recruited to the site. The alter-
the same chromosome, in some cases near to and in other native interactions of the locus control region, with foetal-
cases distant from the gene they control, or on entirely dif- and with adult-expressed genes, suggests that expression
ferent chromosomes. is determined by a competition between these interactions
Control of globin gene expression is asserted for both (see Figure 2.2b,c).
tissue specificity and developmental progression. In A well-known enhancer of globin expression is erythro-
humans, transcription of the b-globin region switches from poietin, a glycoprotein hormone encoded on chromosome
the embryonic e-globin in the yolk sac to the foetal g- 7. Erythropoietin does not interact with sequences in the
globins in the liver (Gg and Ag) and finally to the adult vicinity of the globin locus but works indirectly by activat-
b-globin produced in cells derived from bone marrow. ing intracellular signalling pathways by binding to a recep-
An essential mechanism of control is modulation of local tor. Erythropoietin expression is sensitive to oxygen tension.
chromatin conformation. Chromosomes contain nucleopro- Hypoxia increases erythropoietin production. This occurs
tein complexes called chromatin. DNA is associated with naturally at high altitudes; as a result, people who live at
proteins called histones (Figure 1.2). Chromatin conforma- 2500 m above sea level have about 12% more haemo-
tional changes can be induced by covalent modification of globin than people who live at sea level. Athletes take
histones and by binding of chromatin-remodelling proteins. advantage of this. About 3 months of adaptation time is
Differential sensitivity of sites to DNAse I digestion mea- necessary to build up this differential.
sures differences in exposure. Some regions near actively A surprising player in the game of globin expression
transcribed genes are hypersensitive to DNAse I digestion regulation is acetylcholinesterase, most famous for its
(see Figure 2.2a). The locus control region upstream of the physiological role in neural synapses and neuromuscular
b-globin locus consists of five hypersensitive regions. The junctions. Acetylcholinesterase also regulates globin synthesis.
Locus Locus
control control
Locus control region region region
Protein-coding ε
DNAse
hypersensitive regions
segments ε Gγ Aγ δ β Gγ
Y-rich ε
15 kb control region Aγ PYR
60 kb Gγ Aγ YR δ β YR δ β
control region Protein-coding
(a) (b) Protein-coding regions (c) regions
Figure 2.2 (a) b-Globin region showing protein-coding genes and locus control region, consisting of five DNAse hypersensitive
segments. Pseudogene yb not shown. (b,c) Schematic structural model for the control of globin gene expression. Part (b) shows
the foetal structure, with interaction between the locus control region and the Gg and Ag genes, mediated by proteins (green).
Blue circles indicate chromatin-remodelling complexes. In this configuration, the Gg and Ag genes will be expressed. In part (c),
the cyan circle indicates the PYR complex of proteins, which binds to the pyrimidine-rich (YR; Y stands for pyrimidine) region
just 5′ to the d gene. This binding reconfigures the system: PYR blocks the foetal mode of interaction of the locus control region
with the g region (b). Instead, the locus control region interacts with, and promotes the expression of, the b gene.
Adapted from: Bank, A. (2005). Understanding globin regulation in b-thalassemia: it’s as simple as a, b, g, d. J. Clin. Invest. 115, 1470–1473.
46 2 Genomes are the Hub of Biology
An example of gene silencing by DNA methylation is is not. The cell that produced Cc had an inactivated X
the formation of Barr bodies in female mammals. Cells of chromosome and this inactivation was replicated in all of
mammalian females (except for oocytes) have two X chro- Cc’s cells.
mosomes. The product of the Xist gene (an RNA molecule)
on one of the X chromosomes inactivates that entire chro-
mosome and causes it to form a compact, transcriptionally
inert object, called a Barr body. The Xist gene on the other
X chromosome is inactivated by cytosine methylation,
leaving that chromosome normal in structure and activity.
As a result, cells of both males and females have one active
X chromosome. This is the mammalian solution of the
‘dosage compensation’ problem that arises because the
genomes of males and females contain different numbers
of copies of X chromosome genes.
In human females, and females of other placental mam-
mals, each cell chooses at random which X chromosome to
inactivate. Most mammalian females are, therefore, mosaics
of cells expressing genes from alternative X chromosomes.
A visible example of this mosaicity is a calico cat, necessar- (a)
ily a female (see Figure 2.3b). A calico cat has a yellow-coat
allele on the X chromosome inherited from one parent
and a black-coat allele on the other X chromosome. (In the
white patches of the coat, neither allele is expressed.)
The size of the coloured patches on the coat reveal when
the genes were inactivated. In contrast, in female mar-
supials, all cells inactivate the paternally derived X chro-
mosome. (This is an example of the general phenomenon
of genetic imprinting, the dependence of phenotype on
the parental origin of a gene.) (b)
A difficulty in cloning of higher animals is how to restore
Figure 2.3 (a) Cc – or Copy Cat – a cloned cat. (b) Rainbow
pluripotency to the single cell from which the animal will
– a calico cat, the source organism for Cc. The varied-colour
develop. Figure 2.3 shows (a) a cloned cat, named Cc
appearance is reminiscent of Indian maize (Figure 1.5), and
(for Copycat), and (b) its source organism (very loosely, indeed there are some similarities in the mechanisms that
its mother), Rainbow, a calico cat. Although the two cats created them.
are genetically identical as far as the nucleotide sequences Photos reproduced courtesy of The College of Veterinary Medicine &
of their DNA are concerned, Rainbow is a mosaic and Cc Biomedical Sciences, Texas A&M University.
Some mechanisms of regulation act at the level mone that stimulates ripening. Genetic modification
of translation – a controversial activity in the context of agriculture
Antisense RNA will form a double helix with mRNA, – has produced a tomato with longer shelf life. An
and block transcription. (Antisense RNA is single- artificial gene in the ‘Flavr Savr’ tomato is transcribed
stranded RNA complementary to mRNA.) to an antisense RNA that greatly reduces translation
Introduction of genes for antisense RNA can of a gene involved in ethylene synthesis, delaying
silence genes. For example, ethylene is a plant hor- ripening.
Expression patterns 47
In RNA interference, a short stretch of double- Cells regulate the activities of their proteins by
stranded RNA (∼20 bp) elicits degradation, by a mechanisms applied at the levels of transcription,
ribonucleoprotein complex, of mRNA complemen- translation, post-translational modification of proteins,
tary to either of the strands. RNA interference may and response of proteins to ligation (see Figure 2.4).
have a natural function in defence against viruses. It Although these processes are biochemically distinct,
has been applied in the laboratory to achieve effec- cells apply them in coordinated ways. Chemical modi-
tive gene knockouts in studies aimed at deducing fications of transcription factors can quantitatively
gene functions. regulate amounts of proteins in cells, rather than
Another mechanism of translation control is the simply switching transcription on and off. Different
attachment of ligands to the Shine–Dalgarno sequence control processes are effective over different time
in mRNA, preventing the RNA from binding to the scales: cells must sometimes react quickly, to threat
ribosome. In E. coli, vitamins B1 (thiamin) and B12 or stress; at other times rhythmically, over cell cycles;
(adenosylcobalamin) bind to an mRNA containing and sometimes in programmes unfolding over years
transcripts of genes encoding proteins involved in or even decades during development of an organism.
their biosynthesis. This is a kind of feedback inhibi- It is not possible to sort the different control mech-
tion – adequate amounts of a product inhibit the anisms into fast- and slow-acting categories. There are
synthesis of more. However, unlike the more familiar layers of complexity here that we are only beginning
product inhibition of an enzyme, this control is to understand.
applied at the level of RNA, rather than protein.
Different modes of transcriptional control vary in
• There are many mechanisms, and types of targets,
their dependence on external conditions and all are
for regulation. These include control over expression
reversible – some more readily than others. The lac- patterns, and control over the activities of proteins and
tose operon control in E. coli is a ‘toggle’ switch that non-coding RNAs in the cell. For instance, allosteric
can respond to both the appearance of lactose and its changes are ligand-induced conformational changes in
subsequent exhaustion. Other control processes are proteins that modify activity, often leading to coopera-
cyclic, such as cell-cycle control and diurnal rhythms. tive binding curves, as in haemoglobin.
Cellular differentiation in higher eukaryotes is usu-
ally irreversible, except in certain forms of cancer.
Translation of eukaryotic genes may produce several DNA
different splice variants. Regulation of translation in Gene copy number
eukaryotes may affect the distribution of splice vari- Promotor activity
Transcription
ants produced. Mechanisms include (1) degradation Repression/attenuation
exons (or even introns) and interact with the splicing Codon usage, tRNA levels
Translation
machinery to direct maturation of the mRNA. At Ribosome binding
Alternative splicing
the transcriptional level, chromatin remodelling may
render certain exons inaccessible and affect the splice RNA interference
Figure 2.7 Alternative social behaviours of C. elegans, feeding on a lawn of agar in a Petri dish. Left: solitary feeding. Right: clumping.
The agar is slightly thicker at the edges of the dish, which is why the worms congregate there.
From: de Bono, M. & Bargmann, C.I. (1998). Natural variation in a neuropeptide Y receptor homolog modifies social behavior and food response in
C. elegans. Cell 94, 679–689.
Models for neurological disease, in natural or of 65. It affects about 10% of those with Alzheimer’s
transgenic Drosophila and C. elegans, include Par- disease. The familial form is associated with muta-
kinson’s and Huntington’s disease, Friedrich’s ataxia tions in genes on chromosomes 1, 14, and 21. In con-
(a degenerative disease of the nervous system), and trast, a mutation in the gene for ApoE on chromosome
early-onset dystonia (neuromuscular dysfunction 19 causes increased risk of late-onset Alzheimer’s
producing sustained involuntary and repetitive muscle disease (see pp. 61–62).
contractions or abnormal postures). Species more
closely related to humans, including zebrafish and Genetics of behaviour
mice, provide models for most mental diseases. • Different strains of C. elegans show different social
Traditionally, there was a fairly rigid distinction behaviour in dining. C. elegans can be grown on
between neurological/physiological/biochemical dis- an agar lawn in a Petri dish. The wild type col-
eases affecting cognitive performance, and psychiat- lected from Australia will congregate in groups
ric diseases believed to have emotional causes and to feed. The wild type collected from Britain will
effects. However, we now recognize that even if an eat separately (see Figure 2.7). The difference has
illness is caused by an unhealthy emotional envir- been traced to a single amino acid change in a
onment (and leaving aside the genetic component of seven-transmembrane helix protein, NPR-1.
susceptibility to disease in response to such an envir-
onment), organic changes – down to the molecular The unity of life is a theme of this book, but perhaps it
level – underlie the psychological manifestations. would be wrong to read too much into this example.
There is now good evidence for a substantial
genetic component in numerous psychological condi- • In fruit flies, T. Tully and co-workers have identi-
tions, including dyslexia, autism, schizophrenia, fied a gene associated with memory. CREB (cyclic
attention-deficit hyperactivity disorder, and others. A AMP response element-binding protein) encodes
number of conditions appear in both familial forms, a transcription factor, part of a large family of
showing relatively simple patterns of inheritance, paralogues in mammals and distributed widely in
and sporadic forms, with more complex genetic eukaryotes and prokaryotes. Some engineered
components and greater influence of environment. changes in CREB produced flies that could learn
Familial forms often show early onset. but not store memories; other changes produced
For instance, familial Alzheimer’s disease is a rela- flies that learned substantially faster than normal.
tively early-onset form of the condition, with early Alterations in mouse CREB have produced mem-
onset defined in this case as appearing before the age ory impairment.
52 2 Genomes are the Hub of Biology
• The Lesch–Nyhan syndrome was the first correla- versely, amphetamines, which release dopamine,
tion discovered between a specific human genetic aggravate the symptoms.
defect and behavioural anomalies. It is an X- • Depression is a common response to stress.
linked deficiency of a single protein, the enzyme Although we all suffer stressful episodes in our
hypoxanthine–guanine phosphoribosyltransferase. lives, the development of debilitating mental dis-
The consequent inability to metabolize uric acid ease depends on our alleles in the promoter region
properly leads to physical symptoms including of the gene for the serotonin transporter (5-HTT).
gout and kidney stones. But patients also show Individuals with one or two copies of the shorter
poor muscle control, mental retardation, facial allele of this gene are more likely to exhibit depres-
grimacing, writhing and repetitive limb move- sion and suicidal tendencies than individuals
ments, and uncontrollable lip and finger biting to homozygous for the longer allele.
the point of severe self-mutilation.
• Maltreatment in childhood is a cause of antisocial
• Seasonal affective disorder, a mood change in behaviour. Obviously. However, the likelihood
response to prolonged darkness, is related to the of development of antisocial behaviour depends
regulation of circadian rhythms. Jet lag is a related on the allele for monoamine oxidase A (MAOA).
condition. The system of genes involved in circa- Maltreated children with a genotype producing
dian rhythms was originally worked out in fruit high expression levels of MAOA are less likely to
flies and has homologues in many animals, includ- become antisocial or violent offenders. A. Caspi
ing humans and C. elegans, and in plants. and colleagues found that, although only 12% of a
• Mutations in an X-linked human gene, DLG3, cause sample of people had the combination of low-
severe learning disability. Knockout of the mouse activity MAOA genotype and childhood maltreat-
homologue PSD95 produces learning-impaired ment, these individuals accounted for 44% of
mice. The protein encoded by PSD95 binds to the subsequent convictions for violent offences.
NMDA receptor, a protein involved in synaptic
plasticity. Overexpression of NMDA receptor
gives mice superior learning and memory abilities. • Cognitive abilities are our species’ proudest achieve-
• Several genes are implicated in schizophrenia. A ment. Neuropsychiatric illness is a correspondingly
difficult problem to understand and treat. Both have
common effect, hypersensitivity to the neurotrans-
genetic components that are beginning to be under-
mitter dopamine in the brain, plays a role in
stood, in part through study of analogues in other
schizophrenia. Some antipsychotic drugs block
species.
dopamine receptors on neuronal surfaces; con-
Populations
genome sequence. The target may be to identify one analysis of the samples to determine an additional
base out of 3 × 109. By correlating phenotype with 4.6 million SNPs from the same individuals.
haplotype, only enough sequence must be collected The work of the International HapMap Consor-
to localize the site to within the typical length of a tium, together with other studies, show that:
haplotype block, perhaps ∼100 kb, containing only a
• Most of the variations appear in all populations
few genes.
sampled. Some of the inter-population differences
reflect different relative amounts of the same SNPs.
• Think of boundaries between haplotype blocks as • A very few SNPs are unique to particular popula-
being like the grooves in a bar of chocolate that permit tions. For example, out of over 1 million SNPs,
it to be broken easily into bite-size fragments. only 11 are consistently different – in the sample
studied – between all individuals of European origin
and all individuals of Chinese or Japanese origin.
Variations in human genomes are the subject of
• The genomes of individuals from Japan and China
several large-scale projects.
are very similar, suggesting more recent common
The SNP Consortium (http://snp.cshl.org) collects
ancestry than other population pairs in the study.
human SNPs. Its database currently contains nearly
4.2 million SNPs. • The X chromosome varies more between different
The International HapMap Project collects and populations than other chromosomes. This may
curates haplotype distributions from several human arise from the fact that males contain only one X
populations. SNPs are its raw material, from which it chromosome, the genes on which are, therefore,
identifies the correlations among them. Phase I of the more subject to selective pressure. Recombinations
project, published in October 2005, had the goal of of X chromosomes can occur, but only in females.
measuring the distributions of at least one SNP every • Lengths of haplotype blocks vary among the dif-
5 kb across the whole human genome. Blood samples ferent sources of samples. They tend to be shorter
were provided by 269 individuals from four contin- among populations from Africa, consistent with
ents (see Box 2.4). Over 1 million SNPs of significant the idea of an African origin of the human species.
frequency (>5%) were documented. In addition, The idea is that the older the population – more
ten selected 500 kb regions were fully sequenced accurately, the larger the number of generations –
from 48 of the samples. Phase II will extend the the greater the chance of recombination.
M17
M05 M08
M06
M07
M04
M16
M02
M03 M09
M24
M23
M01
M10
M12
M11
M15 M13
M14
Algeria
oc
M19
High Atlas Akfadou
or
M
Marrakech
0 200 km
M20 Gibraltar Djurdjura
M21
M18
M22
Figure 2.8 Haplotype network from 428-bp segments of mitochondrial DNA collected from free-living Barbary macaques (Macaca
sylvanus) inhabiting Gibraltar, Algeria, and Morocco. The size of each coloured circle is proportional to the number of individuals bearing
each haplotype. Each line segment represents a single mutation. Thus, the sequences represented by M18 and M19, just to the right of
the inset map, differ by three mutations. The colours of the circles in the graph indicate locations of sample collection (see inset map).
Copyright National Academy of Sciences, reprinted by permission.
β2m
Mutations and disease
Many mutations, even if they are not synonymous
mutations, are consistent with a healthy life, and typ-
Figure 2.9 The immunological distinction between ‘self’ and ical life-span, provided that the individual practises a
‘non-self’ resides in the proteins of the major histocompatibility
reasonable lifestyle. Such mutations contribute to our
complex (MHC) and their interaction with T-cell receptors. This
picture shows a human T-cell receptor in complex with a class I ethnic and individual variety. Loss of some proteins
MHC protein and a viral peptide. MHC proteins have broad is surprisingly innocuous. Mice lacking myoglobin
specificity, each binding many peptides including those of self thrive, and even show athletic performance com-
and non-self origins. Cell surfaces contain large numbers of parable to normal mice.
MHC–peptide complexes, among which those binding foreign
In some cases, species-wide loss of biosynthetic
peptides are a small minority. T-cell receptors, in contrast, have
narrow specificity, and pick out the complexes containing foreign
enzymes is not generally considered a disease,
peptides, like a professional antiques dealer spotting a valuable but contributes to the list of essential nutrients.
item in a rummage sale. For instance, whereas most animals can synthesize
58 2 Genomes are the Hub of Biology
vitamin C, we must provide it in our diet. If not, the betes therapy. Mutations that destabilize proteins can
result is the disease scurvy. increase the proportion of misfolded proteins, which
Nevertheless, evolution has largely optimized pro- show a greater tendency to form aggregates.
teins for their roles in healthy organisms. Therefore, Some mutations that produce defective proteins
most mutations causing amino acid sequence changes show complex interactions with other traits, with the
are deleterious, impairing protein function and result that some deleterious mutations have not been
threatening to produce disease. eliminated from populations by selection, because they
Mutations associated with human disease are col- carry some compensating advantages. For example,
lected by the organization Online Mendelian Inheri- the genes for sickle-cell anaemia and for glucose-
tance in Man (OMIM)™ (http://www.ncbi.nlm.nih. 6-phosphate deficiency confer resistance to malaria
gov/omim). A corresponding site, Online Mendelian (see Box 2.5).
Inheritance in Animals (OMIA) (http://www.ncbi. Dysfunction of a regulatory protein or receptor
nlm.nih.gov/omia) collects mutations associated with can disorganize the operation of a pathway even if all
diseases in animals, other than human and mouse. components of the pathway regulated are normal.
Some abnormal regulatory proteins cannot be activated
By what mechanisms can mutations affect at all, whereas others are constitutively activated and
human health? cannot be shut off. The effects include:
Mutations causing defects in some proteins can be
Physiological defects: a number of diseases are asso-
accommodated by adjustments in lifestyle. Other
ciated with mutations in G protein-coupled recep-
mutations cause disease only in combination with
tors. Some mutations in opsins are associated with
unusual features of lifestyle or specific triggering
colour blindness. Certain mutations in the common
events. We have already mentioned the Z-mutation
G protein target of olfactory receptors lead to loss of
of a1-antitrypsin, smoking enhancing its tendency to
sense of smell.
cause emphysema.
Loss-of-function mutations are often recessive, so Developmental defects: several types are traceable
that the homozygosity for the mutant allele typically to mutations in hormone receptors. For instance,
has more severe consequences than heterozygosity. Laron syndrome, a phenotype including diminished
Every individual is heterozygous for some deleterious stature, arises from a mutation in the human growth
mutations that, if homozygous, would be lethal. hormone receptor. Administration of exogenous
Many diseases are associated with the formation growth hormone does not restore normal growth.
of insoluble aggregates, usually of misfolded proteins.
These include classical amyloidoses, Alzheimer’s and • Many diseases are caused directly by mutations; many
Huntington’s disease, aggregates of misfolded serpins, others have a genetic component and arise in the con-
and prion diseases. Polymerization of insulin creates text of an interaction between genetics and lifestyle.
problems in production, storage, and delivery in dia-
metabolically impoverished, and have no alternative mech- the tailoring of drug treatments to the genotype of the
anism for detoxifying H2O2. Without active G6PDH, build- individual patient.
up of H2O2 will oxidize and denature haemoglobin, leading Why have dysfunctional G6PDH genes remained at such
to destruction of red blood cells, producing a condition a high level in the population? Why does primaquine
called haemolytic anaemia. produce haemolytic anaemia in G6PDH-deficient patients,
Eating fava beans, especially if uncooked, can induce and does this have a relationship to its antimalarial activity?
anaemic episodes in people deficient in G6PDH. The And why have fava beans continued to be grown if non-
danger of eating fava beans has been recognized since toxic alternatives are available?
antiquity, and has been associated with food taboos, and The malarial parasite invades the red blood cell of its
preparation techniques designed to reduce toxicity. host, and competes metabolically with normal activity.
Pythagoras, for example, banned eating of fava beans Primaquine and related drugs, such as chloroquine, subject
in his school. We now know that fava beans contain the the red blood cells to oxidative stress. Cells stressed by
compounds vicine and convicine, metabolized in the intes- both parasite and drug are the most vulnerable, and if they
tine to isouramil and divicine, which react with oxygen to die they take the parasite down with them. Because con-
produce hydrogen peroxide, subjecting cells to oxidative sumption of fava beans subjects cells to oxidative stress,
stress. they also provide an antimalarial effect, recognized in folk
Other chemicals, including certain drugs, present the medicine. Indeed, fava beans have some effect against
same danger to G6PDH-deficient people. During World malaria even for people with normal G6PDH activity; those
War I, some patients were observed to suffer dangerous with abnormal G6PDH have a greater advantage, until the
side effects from the antimalarial drug primaquine. Many maturing Plasmodium produces its own G6PDH.
drugs, including sulphonamides, are now contraindicated The link with malaria is the likely explanation of the per-
for G6PDH-deficient patients, as is the taking of large sistence of the gene in the population and the fava bean
doses of vitamin C. The observation of variations in effec- in agriculture. A final clue appears in the calendar: there is
tiveness and toxicity of different drugs in different people good overlap between the fava bean harvest period and
has developed into the new field of pharmacogenomics, the peak Anopheles breeding season.
Haemoglobinopathies – molecular diseases This creates a ‘sticky patch’ on the surface of the
caused by abnormal haemoglobins molecule. As a result, the mutant haemoglobin forms
polymers within the erythrocyte, in the unligated or
Sickle-cell anaemia: Pauling and co-workers showed deoxy state (in the deoxy form, typical of venous blood,
in 1949 that haemoglobin isolated from patients haemoglobin is not binding oxygen). To flow through
with sickle-cell anaemia differed in electric charge small capillaries, erythrocytes must be deformable, as
from normal haemoglobin. Patterns of inheritance their typical size, 7.8 mm (in humans), is larger than
showed that sickle-cell anaemia is a genetic disease. the diameter of small capillaries. The formation of the
Pauling’s discovery was therefore the first evidence polymers has a rigidifying effect on the erythrocyte,
that genes precisely control the structures of proteins. impeding its flow and blocking capillaries (Figure 2.10).
This preceded the first determination of the amino In the traffic jam building up behind a plugged capil-
acid sequence of a protein. lary, arriving red cells release their oxygen to sur-
Recall that haemoglobin contains four polypeptide rounding hypoxic tissues, become deoxygenated, and
chains, two a chains and two b chains (Figure 1.16). thereby aggravate the problem. This produces pain
The sickle-cell mutation changes residue 6 of the b – because of reduced oxygen supply resulting from
chain from a charged sidechain, glutamic acid, to a capillary congestion – and anaemia and jaundice, con-
nonpolar one, valine (b6Glu→Val). sequences of the rapid breakdown of red blood cells.
60 2 Genomes are the Hub of Biology
Phenylketonuria
Phenylketonuria (PKU) is a genetic disease caused
by deficiency in a metabolic enzyme, phenylalanine
hydroxylase, the enzyme that converts phenylalanine
Figure 2.10 Coloured scanning electron micrograph of normal to tyrosine (see Figure 2.11). If untreated, phenyla-
(round) and sickle-shaped erythrocytes. The shape of the cells is lanine accumulates in the blood, to toxic levels. If
caused by aggregation of mutant haemoglobin in its deoxy state.
untreated, the high levels of phenylalanine cause a
The symptoms of the disease are caused not so much by the
shape of the cells but by their inability to distort their shape in variety of developmental defects, including mental
order to pass through the narrow capillaries. retardation, microcephaly, and seizures. The disease
(Science Photo Library). cannot be cured, but symptoms can be avoided by
lifestyle control: a phenylalanine-free diet. Screening
of newborns for PKU is legally required in the United
The most common SNP associated with sickle-cell States and many other countries.
anaemia changes the codon gag to gtg. It is possible PKU is an example of how understanding the
to test specifically for this mutation. Alternatively, molecular mechanism of a disease can, for some con-
sequencing the entire region would pick up other ditions, help restore and preserve health through sug-
possible mutations, that might lead to thalassaemias. gested changes in lifestyle and/or medical treatment.
Thalassaemias are genetic diseases associated with PKU is an autosomal recessive trait, associated
defective or deleted haemoglobin genes: most Cauca- with mutations in the phenylalanine hydroxylase
sians have four genes for the a chain of normal adult gene on the long arm of chromosome 12 (12q22–
haemoglobin, two alleles of each of the two tandem 12q24.1). In the UK and USA the prevalence is
genes a1 and a2. Therefore a-thalassaemias can pres- about 1 in 10 000 individuals; 1 out of 50 are car-
ent clinically in different degrees of severity, depending riers. PKU is the subject of neonatal screening in
on how many genes encode normal a chains. Only many countries.
deletions leaving fewer than two active genes present A large number of known mutations are associated
as symptomatic under normal conditions. Observed with PKU, appearing in all 13 exons of the gene and
genetic defects include deletion of both genes (a pro- in flanking sequences (http://www.pahdb.mcgill.ca/).
cess made more likely by the tandem gene arrange- Many but not all of them are SNPs. These include
ment and sequence repetition, which make crossing many that affect catalytic activity, somewhat fewer
over more likely) and loss of chain termination that affect regulation, and even fewer that affect the
leading to transcriptional ‘read through’, creating assembly of the tetrameric enzyme.
extended polypeptide chains, which are unstable. The classical test for PKU depended on the build-
b-thalassaemias are usually point mutations. These up of phenylalanine and its degradation products
may be: such as phenylpyruvate in neonatal blood and urine.
Genetic diseases – some examples of their causes and treatment 61
ApoE1 = rs429358(C) + rs7412(T) [minor variant] mutations that break down the controls on cell
ApoE2 = rs429358(T) + rs7412(T) growth. The source can be in three classes of genes:
ApoE3 = rs429358(T) + rs7412(C) [∼55%] genes that regulate cell proliferation, genes required
ApoE4 = rs429358(C) + rs7412(C) for repair of DNA damage, and genes that control
apoptosis.
The correlations with risk of Alzheimer’s disease are:
The ‘two-hit hypothesis’ interprets the relation
At least one E4 allele → increased risk of between sporadic and familial forms of disease to the
Alzheimer’s need to mutate both copies of such genes.
At least one E2 allele → decreased risk of Consider, as an example, retinoblastoma, a rare
Alzheimer’s childhood tumour of the eye. Approximately 30–40%
of cases are familial; the rest sporadic. The familial
form shows an autosomal dominant inheritance
SNPs and cancer pattern. Clinical characteristics of familial retino-
SNPs are relevant to cancer research and treatment in blastoma that distinguish it from the sporadic picture
several ways: are early onset, and the appearance of multiple tum-
ours, affecting both eyes.
(1) Mutations detectable in the genome indicate pro- The two-hit hypothesis offers an explanation for
pensity for development of cancers. Mutations in the differential age of onset, and severity, of familial
BRCA1 and BRCA2, as indicators for likelihood and sporadic retinoblastoma. The idea is that non-
of breast and ovarian cancer development, are familial cases require inactivation of both copies of
probably the best known. retinoblastoma gene, each of which was originally
(2) Sequence analysis can predict disease progression functional. Separate and independent mutations are
and outcome. necessary. In contrast, familial retinoblastoma affects
(3) Sequence analysis can help choose optimal a person who has inherited one defective and one
treatment. functional copy of the gene. That is, the first hit
is inherited; all that is needed is the second hit (see
(4) Tumour progression often involves mutations
Figure 2.12).
and divergence of cell lines.
Tumour suppressor genes protect cells against
The onset of cancer is associated with loss of development of cancer. They encode proteins that
genome integrity. Cancer results from accumulated inhibit tumour formation. Their normal function can
Rb Rb Rb Rb Rb Rb
Tumour
second hit
Rb Rb Rb Rb
Tumour
Figure 2.12 Explanation of the difference between sporadic and familial retinoblastoma, a rare cancer of the retina. According to the
two-hit theory, two copies of a gene must be inactivated. In sporadic retinoblastoma, both copies are originally functional, and two
separate, independent mutations are required to inactivate them. In familial retinoblastoma, one defective copy of the gene is inherited.
Only a single mutation, in the other allele, is required to produce the disease.
Genetic diseases – some examples of their causes and treatment 63
be to inhibit cell growth; mutations ‘take the foot off Many BRCA1 mutations are known. (Many are
the cell-growth brake’. Mutants in these genes raise not SNPs.) Their prevalence varies among popula-
the risk of developing cancer. tions, showing strong founder effects (see Table 2.1).
Well-known examples of tumour suppressor genes Testing for mutations in these genes is now quite
are: BRCA1 and BRCA2. In the general population: common (see Box 2.7).
∼12% of women will develop breast cancer;
Table 2.1 Common BRCA1 and BRCA2 mutations
∼1.4% of women will develop ovarian cancer.
Population Common Common
Of women with a harmful mutation in BRCA1 or
BRCA1 BRCA2
BRCA2: mutations mutations
∼60% will develop breast cancer; Ashkenazi Jews 185delAG, 5382insC 6174delT
∼15–40% will develop ovarian cancer. Iceland 999del5
Denmark 2594delC, 5208T→C
BRCA1 and BRCA2 encode long proteins unre-
lated in sequence and structure (see Box 2.6). Both Lithuania 4153delA, 5382insC,
61G→C
proteins are required for chromosome stability, par-
China 589delCT, IVS7–27del10, 3337C→T
ticipating in mechanisms of repair of DNA double-
1081delG, 2371–2372delTG
strand breaks.
BOX Genetic testing for mutations in breast cancer genes BRCA1 and BRCA2
2.7
Breast cancer is a leading killer of women in western Europe Governments must decide how to apportion resources,
and the USA, affecting over 13% of the population (100 taking into account the cost of any procedure and the util-
times as many women as men) and causing death in about ity of the information it produces. Mutations in BRCA1 and
3% of all women. Among the known genetic factors that BRCA2 are associated with only about 5–10% of breast
raise the risk of breast cancer are mutations in the genes cancers. Conversely, not all women with a mutation in
BRCA1 and BRCA2. either of these genes develop cancer, although over 50%
Screening for mutations in these genes can provide do. The BRCA1 and BRCA2 genes are long, multiexon
a risk-alerting system for breast and ovarian cancer. sequences each >100 kb long, making total resequencing
However, planning of population-wide screening pro- of the genes a complicated procedure. Many mutations are
grammes must take into consideration cost/benefit analysis. known, spaced widely within the exons. Given that even
In many countries, medical care delivery policy is set by a complete resequencing of the genes would not provide a
national health service. (The USA is a notable exception.) helpful prognosis in many cases, it is not deemed useful,
➔
64 2 Genomes are the Hub of Biology
with current technology, to screen the entire population For these individuals, the likelihood of finding useful
fully for BRCA1 and BRCA2 mutations. information certainly justifies genetic testing.
The logic of the decision changes for individuals sus- In some genetically relatively homogenous populations,
pected to be at high risk. These include women who: practical advantage can be taken of the observation that
specific mutations in BRCA1 and/or BRCA2 are common in
• have a close relative known to have a BRCA1 or BRCA2
that population. For instance, Ashkenazi Jews are about 20
mutation;
times more likely to bear a mutant in one of the two genes
• have close relatives who have been diagnosed with than the general population. Three mutations – 185delAG
early-onset (age <50 years) breast or ovarian cancer; or and 5382insC in BRCA1 and 6174delT in BRCA2 – account
• have themselves been diagnosed with breast cancer for 90% of inherited breast and ovarian cancers in this
and want to know their likelihood of developing ovarian group. Testing for these three mutations or for any specific
cancer. mutation known in a relative – whether one of these three
Collect sample
Wild type Mutant
AUG CAA Stop AUG UAA Stop
AAAA Isolate mRNA AAAA
(New Stop)
PCR amplification
In vitro
transcription and translation
W1 Immunoprecipitation and M1 M2
gel electrophoresis M3
W2
W3 W1 M1 W2 M2 W3 M3
Figure 2.13 Schematic explanation of the protein truncation test. Starting from mRNA, or in some cases from genomic DNA,
a series of fragments covering the gene or exon of interest is amplified. In vitro transcription and translation produces the
corresponding polypeptide chains. A mutation that introduces an internal stop codon or a deletion leading to a frame shift
produces some shortened fragments that can be distinguished on a gel.
Straight lines indicate nucleic acids; wavy lines indicate polypeptides. W1, W2, and W3 are peptides derived from the wild-type
gene. M1, M2, and M3 are derived from the mutant gene, using the same primers. The red and blue dots in the peptides
correspond to the position of the CAA in the wild-type mRNA and the UAA in the mutant mRNA. In this example, the peptides
W1 and M1 have the same size, but M2 is shorter than W2 and M3 is shorter than W3. Note that the mobilities of the peptides
are not strictly proportional to their molecular weights.
The triplets in parentheses refer to the original mRNA sequence. Different bases appear in the molecules that are created during
the procedure. (See Exercises 2.6–2.9.)
Genetic diseases – some examples of their causes and treatment 65
or any other – is much simpler and more cost-effective • The protein truncation test detects premature stop
than a full resequencing of the genes. Resequencing all of codons by amplifying the coding region of exon 11 of
the exons of the gene is at present about ten times more BRCA1 or exons 11 and 12 of BRCA2 (see Figure 2.13).
expensive and can always be considered as a subsequent • The single-stranded conformational polymorphism test
step if none of the specific mutations appear. This is likely is applied to analysis of exons 2–10 and 12–24 of BRCA1
to change, as the cost of sequencing diminishes. and exons 2–10 and 12–27 of BRCA2 (see Figure 2.14).
What techniques are used to screen for common
• Full resequencing of the gene.
mutations?
C T
PCR
G A
C T
Heteroduplexing
G G A A
C T C T
Conformation-sensitive
gel electrophoresis
Figure 2.14 Conformation-sensitive gel electrophoresis, a method for detecting localized mutations. This figure shows the
procedure for analysing a sample from a heterozygote, containing normal DNA from one chromosome (red) and a mutation
on the other (blue). The region surrounding the site of suspected mutation is amplified by PCR. Melting and annealing produces
the two original double-stranded regions and also two mismatched pairs. The regions are sufficiently similar to form a hybrid pair
despite the single-base mismatch, but the mismatch causes sufficient conformational deformation to alter mobility on the gel,
especially under partially denaturing conditions. A normal sample would produce only a single band (left lane on the gel), but the
mixture of normal and mutant would produce additional bands, the mismatched pairs showing altered mobility.
Species
The next step up from populations is species. organisms are discrete entities. The taxonomic hier-
Species are a fundamental unit of evolution. Spe- archy is in principle a classification of species. We
cies represent nature’s experiments in structures and now recognize that the static classification is largely
lifestyles. It is species that were the elements of Lin- congruent with the evolutionary tree of ancestor–
naeus’s taxonomy, and it was the emergence of novel descendant relationships.
species that gave Darwin his life’s work and the title
of his major book. • Why living things should be ‘quantized’ into discrete
It is extremely rare in science that a concept so species is a very subtle question.
fundamental to a field as species is in biology would
be so difficult to define. Genomics has greatly illu-
minated but not solved the problem. A fundamental paradox about the biology of
To understand what we mean by species we need higher organisms is that although species are discrete,
not only an explanation of the concept, but a crite- new species can evolve. Darwin’s finches are a classic
rion for deciding whether two organisms – or, more example of reproductive isolation of populations
appropriately, two populations – belong to the same that have diverged into separate species. Resolution
species or different ones. Ernst Mayr enunciated the of this paradox is quite a difficult challenge, but per-
classic biological approach, focusing on the idea that haps a basic component of the answer would be that
different species are reproductively isolated in nature. it is the discreteness of ecological niches – to which
Mayr’s definition applies to sexually reproducing individual species are optimized – that accounts for
organisms: individuals that can mate with each other the discreteness of species of higher organisms.
to produce viable and fertile offspring when they For prokaryotes, in contrast, the species concept
encounter each other in nature are members of the is in serious trouble. In particular the presence of
same species. When reproductive barriers arise – for large amounts of horizontal gene transfer overturns
instance, when groups of animals are trapped on the idea that there is a hierarchy of ancestor-descent
different islands by rising sea levels – the species relationships.
divides into two or more populations that, separately, Few people doubt that if the problem of under-
interbreed only within themselves and maintain two standing species can be resolved, genomics will play
gene pools. The separated populations may pursue a crucial role. Advantages of genomic approaches
different evolutionary paths and ultimately diverge over others are that they allow:
to form separate species. Of course, geographic isola- (1) more accurate assignment of relationships (as
tion is only one possible cause of speciation. we saw in the case of Buddenbrokia plumatellae
Many other definitions have been proposed, includ- (see p. 48)),
ing those based on comparisons of features of pheno-
(2) quantitative measurements of the divergences
types, and divergence of genomic sequences. For
between species, and
higher organisms at least, the different definitions are
usually consistent, but they are not entirely equivalent. (3) estimation of the time elapsed since the last
common ancestor. (Now, classical palaeontology
also allows measurements of dates of origins of
• Many modern biologists define a species as a group of species that have left fossil records. In fact from
similar organisms that interbreed naturally to produce
fossil records we can date extinctions, which is
fertile offspring. An alternative approach is to base spe-
not possible from genomics!)
cies definitions on genomic sequences. This approach
is extremely powerful even for higher organisms, and
essential for prokaryotes. • Here we have described both the importance of the
species concept and some of the problems associated
with it. Both the concept and the problems will be
Any valid definition must take into account the
themes of much of the rest of this book.
fundamental observation that species of higher
The biosphere 67
We simply do not know the number of species that J.B.S. Haldane was asked what he had learned about God
have ever existed. Even for the species alive today, only a from his study of biology, he replied that God must have
fraction have been formally described. About 1.5 million ‘. . . an inordinate fondness for beetles.’
species are known to science (see below), but estimates Given the large numbers of even fairly closely related
suggest that at least twice as many exist, and probably species, one might be tempted to think that the species
even an order of magnitude more than that. that lack scientific description are only minor variations on
Most palaeontologists agree that living species amount familiar themes.
to less than 1% of the number that have ever lived, but This is not true.
it is impossible to infer precise estimates. The Burgess Shale, in the Canadian Rockies near Banff,
British Columbia, Canada, contains some of the world’s
Group of organisms Number of species described best-preserved fossils from the Cambrian era, ∼530 million
Insects 751 000 years ago. The locality contained a rich, predominantly
invertebrate, community that left fossils in an unusually
Other animals 281 000
high-quality state of preservation. Some of the forms are
Higher plants 248 400
similar to known animals, both extant and extinct, but
Fungi 69 000
others show profound structural differences, including
Protozoa 30 800 completely different body architectures. The loss of these
Algae 26 900 species has impoverished the natural world in general and
Prokaryotes 4 800 the science of biology in particular.
Viruses 1 000 These fossils strongly suggest that the living forms we
see are, to a large extent, the result of historical accident
Over half of the described species are insects. Almost 40% and that the course of evolution might easily have been
of the insects are beetles (order Coleoptera). When very different.
The biosphere
Palaeontologist G.E. Hutchinson wrote a book enti- tion was propounded by Cuvier in 1796. Nevertheless,
tled The Ecological Theater and the Evolutionary the idea of the immutability of species retained strong
Play. Life on Earth has been a 3.5 billion year epic of adherents, a problem Darwin faced later. When US
exploration and adventure. It has included comedy President Thomas Jefferson sent Meriwether Lewis
and tragedy. Hutchinson emphasized the interde- and William Clark to explore the North American
pendence of environment and living things, which continent, he urged them to be on the lookout for
reciprocally affect each other. (We have already men- mammoths. Anyone could have hoped that mam-
tioned the example of the photosynthetic origin of moths, extinct in Eurasia, might have survived in the
atmospheric oxygen.) New World; the idea of a fixed roster of species
We shall never know the full extent of the species shaded hope into confidence. Cuvier believed that
that Nature has generated in its explorations (see extinction of species was the result of major natural
Box 2.8). Organisms alive today represent only a catastrophes. Indeed, sometimes, natural catastro-
fraction of all that have existed. phes do cause large-scale simultaneous extinctions
(see Figure 2.15 and Box 2.9). An example is the
asteroid that landed in what is now the Yucatan
Extinctions
Peninsula of Mexico, approximately 65 million years
We now recognize that the roster of extant species is ago, at the boundary between the Cretaceous and
in flux. Although Darwin was the first to argue con- Tertiary ages. However, in other cases, species reach
vincingly that new species arise, the idea of extinc- a more peaceful end of their existence – ‘dying
68 2 Genomes are the Hub of Biology
Figure 2.15 Geological ages (e.g. Cenozoic), epochs (e.g. Quaternary), and cataclysmic events (e.g. asteroid impact: mass extinction)
in black. First appearance of, or prevalence of, different life forms in red. mya = millions of years ago.
peacefully in their beds’ so to speak – as a result of at least partly viral in origin. Many Californian
gradual environmental change. almond growers must now import honeybees to
A major mass extinction is going on right now and ensure the pollination of their trees. (The problem
we are largely, although not entirely, to blame. Human is not restricted to almonds: about one-third of our
activity is responsible for the disappearance of spe- food depends on honeybee pollination.)
cies through excessive hunting, habitat destruction,
transport of species that replace native ones, and the Extinction of the thylacine (Thylacinus
introduction of toxic chemicals into the environment. cynocephalus)
Moreover, extinction of one species may cause the It is a sad event when the last individual of a species
loss of others. For instance, some plants are depen- dies. The last thylacine, or Tasmanian wolf, named
dent on a particular insect for pollination. Loss of the Benjamin, died in the Hobart Zoo on 7 September
insect is fatal to the plant, and the network of depen- 1936 (see Figure 2.16). (On the same day, Boulder
dence may be quite far ranging. North American Dam, now Hoover Dam, started operation. Strad-
bees are being killed in great numbers, by the Varroa dling the Colorado River, it provides water and elec-
mite, a parasite that has developed resistance to the trical power to the south-western USA, including the
pesticides that formerly controlled it, and by the rich agricultural lands of Southern California and the
mysterious colony collapse disorder, which may be city of Los Angeles. The opening of the Hoover Dam
The biosphere 69
● RECOMMENDED READING
• Cancer genomics:
Ding, L., Wendl, M.C., Koboldt, D.C., & Mardis, E.R. (2010). Analysis of next-generation
genomic data in cancer: accomplishments and challenges. Hum. Mol. Genet. 19,
R188–R196.
Chin, L., Hahn, W.C., Getz, G., & Meyerson, M. (2011). Making sense of cancer genomic data.
Genes & Dev. 25, 534–555.
Majewski, I.J. & Bernards, R. (2011). Taming the dragon: genomic biomarkers to individualize
the treatment of cancer. Nat. Med. 17, 304–312.
Stratton, M.R. (2011). Exploring the genomes of cancer cells: progress and promise. Science
331, 1553–1558.
• A fascinating general essay about evolutionary biology:
Hutchinson, G.E. (1965). The Ecological Theater and the Evolutionary Play. Yale University
Press, New Haven, CT, USA.
• Discussion of the problem of biological species. Papers from a colloquium in honour of 100-year
old Ernst Mayr.
Hey, J., Fitch, W.M., & Ayala, F.J. (2005). Systematics and the origin of species: An introduction.
Proc. Natl. Acad. Sci. USA 102, 6515–6519.
de Queiroz, K. (2005). Ernst Mayr and the modern concept of species. Proc. Natl. Acad. Sci.
USA 102 (Suppl 1), 6600–6607.
• Current and past extinctions:
May, R.M. (2010). Ecological science and tomorrow’s world. Philos. Trans. Roy. Soc. Lond. B:
Biol. Sci. 365, 41–47.
Barnosky, A.D., Matzke, N., Tomiya, S., et al. (2011). Has the Earth’s sixth mass extinction
already arrived? Nature 471, 51–57.
• Two celebrated palaeontologists interpret the Burgess Shale fossils:
Gould, S.J. (1990). Wonderful life: the Burgess Shale and the Nature of History. W.W. Norton,
New York.
Conway Morris, S. (1998) The Crucible of Creation: The Burgess Shale and the Rise of Animals.
Oxford University Press, Oxford.
See also the exchange between Conway Morris and Gould (1998): Showdown on the Burgess
Shale. Nat. Hist. 107, 48–55.
Exercises
Exercise 2.1 A randomly chosen pair of humans will show an average nucleotide diversity
that lies between 1 base pair in 1000 and 1 in 1500. Any human and any chimpanzee differ in
approximately 1 base pair in 100. (Note that the chimpanzee genome is somewhat larger than
the human.) (a) Estimate the number of differences in the total sequences between two randomly
chosen humans. (b) Estimate roughly the number of differences in the total sequences between a
human and a chimpanzee.
Exercise 2.2 Why is it possible to have calico cats but not calico kangaroos (barring abnormal cell
division at an early embryonic stage or a skin disease)?
72 2 Genomes are the Hub of Biology
Exercise 2.3 From material presented in this chapter, give examples of mutations with clinical
consequences that involve: (a) dysfunction of a metabolic enzyme, (b) ‘read-through’ producing
extended proteins, (c) proteins with lower stability than the normal version, (d) increase in the
risk factor for a disease not known to be associated with the primary (?) function of a protein,
(e) enhanced probability of developing cancer. In each case state, if possible, the nature of the
mutation, and the mechanism by which it produces clinical consequences.
Exercise 2.4 Most newborn mammals can digest lactose in infancy, consistent with their
dependence on maternal milk for feeding. The enzyme lactase hydrolyses the disaccharide lactose,
the major carbohydrate component in milk, to glucose + galactose. In most mammalian species,
expression of lactase ceases at the time of weaning. Many human populations follow the
mammalian paradigm: lactase expression is permanently turned off when a child is about 4 years
old. Loss of lactase expression produces subsequent intolerance to dairy products. A single-site
mutation that causes lifetime lactase expression – or lactase persistence – was selected for in
populations that domesticated cattle and depended on dairy products in their adult diet. In Europe,
lactase persistence is more common in the north, where less exposure to sunlight makes northern
Europeans more dependent on dairy products as sources of calcium and vitamin D precursors.
Ninety-six per cent of Swedes and Finns are lactose persistent.
(a) Would you expect the mutation to be found within the coding regions of the lactase gene
itself?
(b) Heterozygotes produce approximately half the amount of lactase. (This is sufficient to avoid
the symptoms of lactose intolerance.) From this observation, what additional inference can you
make about the location of the mutation?
Exercise 2.5 Suppose that there are ten SNPs in a 10 kb region. If the region is on the Y
chromosome, how many possible haplotypes are there? If the region is on a diploid chromosome,
how many possible haplotypes are there?
Exercise 2.6 In the diagram of the protein truncation test (Figure 2.13), what actual triplets appear
in place of the triplets shown in parentheses?
Exercise 2.7 In the diagram of the protein truncation test (Figure 2.13), which pair of peptides has
the larger difference in length: W2 and M2 or W3 and M3?
Exercise 2.8 In the protein truncation test (see Figure 2.13), in cases in which the novel stop
codon arises from a single-site substitution, as shown in the figure, why would it not work to
subject the amplified cDNA fragments directly to electrophoresis, i.e. to skip the in vitro
transcription and translation step?
Exercise 2.9 What would the gel in Figure 2.13 look like if the person from whom the sample was
taken had the mutation C→G at the same site?
Exercise 2.10 In the conformation-sensitive gel electrophoresis technique for detecting mutations
(Figure 2.14), the gel is run under mildly denaturing conditions to accentuate conformational
differences, thereby increasing the differences in mobility. Why would a gel fail to give useful
information if it were run under fully denaturing conditions?
Exercise 2.11 Why might the prescription of monoamine oxidase inhibitors to a child presenting
with symptoms of anxiety, in the case of suspected maltreatment, be contraindicated?
Exercise 2.12 In Figure 2.8 showing the data on Barbary macaques, where did the animal
corresponding to the isolated orange point come from?
Problems
Problem 2.1 Survival of a SNP. Suppose a person is heterozygous for a novel, selectively neutral
mutation. Suppose the person has two children that survive to reproductive age. The probability
Exercises, problems, and weblems 73
of loss of the mutation in that one generation is 25%. If each descendant has two children that
survive to reproductive age, what is the probability of complete disappearance of the mutation in
200 years. Assume 25 years per generation.
Problem 2.2 Replication of RNA viruses is error-prone. It is estimated that the replication of
HIV-1 introduces one mistake per replication of its ∼105 bp genome. If the estimated generation
time of HIV-1 in a human body is ∼1 day and 1010 progeny viruses are produced per patient
per day, and assuming that an AIDS patient is initially infected by viruses with identical
genomes, estimate whether the patient will (a) generate a mutation at every possible site
in the genome every day; and (b) generate mutations at every possible pair of sites in the
genome every day.
Problem 2.3 Assume the model of expression switching in the human b-globin region shown in
Figure 2.2(b,c). (a) What developmental progression of globin expression would you expect if
the region containing the Gg and Ag genes were interchanged with the region containing the d
and b genes? (b) What developmental progression would you expect in patients with deletions
in the d- and b-regions, including the pyrimidine-rich control region YR?
Problem 2.4 (a) In the experiment that produced Cc (Copycat) as a clone of Rainbow (see
Figure 2.3), the X-linked inactivation of the cell selected was not reversed. If a number of other
cats were cloned by the same procedure, from other cells taken from Rainbow, what coat colours
would the cats produced show? (b) Suppose a way were discovered to reverse the X-linked
inactivation in the cell from Rainbow from which another cat were cloned. Would the cloned
cat have a coat indistinguishable in appearance from Rainbow’s? (c) Would monozygotic twin
(= ‘identical’ twins, in genome sequence at least) natural daughters of Rainbow show identical
coat colour patterns? (d) If your answer to (c) suggests that the twin daughters might not look
the same, how then could you be sure that two daughter cats are monozygotic twins, without
sequencing their entire genomes?
Problem 2.5 A transitive relationship is one such that if A is related to B, and B is related to C,
then A is related to C. (In arithmetic, equality is a transitive relation – if A = B and B = C, then
A = C.) Consider whether ‘belong to the same species’ is a transitive relationship. There are
examples of ‘ring species’, sets of geographically connected populations such that members of
neighbouring populations can interbreed, but at least two populations, not necessarily the most
geographically distant, cannot. A classic example is the Larus gulls, which inhabit regions
surrounding the North Pole (Figure 2.18).
For which definitions of species (see p. 66) is ‘belong to the same species’ a transitive relationship?
Problem 2.6 We know that viruses can deliver genes to mammals, including humans. Examples
include curing X-linked adrenoleukodystrophy in humans, and increasing the apparent intelligence
of mice. (a) What kinds of problems would you expect to arise in setting guidelines for what
genetic modifications of humans should be allowed? (b) How far could you go towards drafting
such a set of guidelines?
Problem 2.7 In a study of brown bear (Figure 2.19) populations in northwest North America,
samples of mitochondrial DNA were collected and sequenced from 317 free-ranging brown bears
(Ursus arctos) from 22 localities. Forty-six variable sites corresponded to 29 haplotypes, which
clustered into four major clades (see Table 2.2). Table 2.3 identifies the location(s) at which bears
with the corresponding haplotypes were found.
(a) On a copy of a map showing Alaska, northwestern Canada, and the lower 48 states as far
south as northern Wyoming, mark in different colours the sites of appearance of bears
with mitochondrial DNA in the different classes. Use the data in Table 2.3. Describe the
geographical distribution of the different classes. Do they overlap substantially?
74 2 Genomes are the Hub of Biology
5
4
6
7
1
Table 2.2 Mitochondrial sequence data from brown bears in northwest North America.
These data do not represent continuous sequences, but only the variable positions.
A dot indicates that the base at this position is the same as in the reference sequence
in the first line
CCCTCCCAACGTTAACATTACGTAATCGAACAGCGGGTTAGGGAAC O
..T.T.T....................................... Q
Clade I ..T.TTT....................................... PU
....T......................................... Q
....T.T....................................... MOPRSV
....TTT....................................G.. U
....TTT....................................... T
......TGG......T.......G.........T.......AA... HN
......TGG.....GT.........................AA.G. G
......TGG......T............G..........G.AA.G. LM
Clade II ......TGG......T............G............AA.G. IJK
......TGG......TG................T.......AA.G. K
......TGG.....GT...G.................C...AA.G. H
......TGG......T.................T...C...AA.G. H
......TGG......T.................T.......AA.G. H
......TGG.....GT.......G.............C...AA.G. H
TT..T.T..T..CG...................T......A.A..T BE
TT..T.T..T..CG........C..........T........A..T ABCDE
TT..T.T.....CG........C.........ATA.......A..T B
Clade III TT..T.T.....CG........C..........T...........T BC
TT..T.T..T..CG........C..........T...........T A
TT..T.T..T.CCG........C.........ATA.......A..T B
TT..T.T..T..CG........C..........T..A.....A..T A
TT..T.T..T..CG...............G...T..A.....A..T C
.T.C......A......C..TA.GGCT...T....A..C...A..T F
Clade IV .T.C....G.A......C.GTA.GGCT...T....A..C...A..T F
.T.C......A......C.GTA.GGCT...T....A..C...AG.T F
(b) You are lost somewhere in northwest North America. You collect a tissue sample from a brown
bear in the vicinity and determine the following mitochondrial DNA haplotype (relative to the
reference sequence in Table 2.2):
.T.C......A......C.GTA. GGCT...T....A..C...A..T
Where are you?
(c) Additional sequences were collected from four polar bears (Ursus maritimus) from zoos:
Reference sequence CCCTCCCAACGTTAACATTACGTAATCGAACAGCGGGTTAGGGAAC
Polar bear 1 .T........A......CCGTA.GG.....T....A..C...A..T
Polar bear 2 .T........A......CCGTA.GG..A..T....A..C...A..T
Polar bear 3 .T........A......CCGTA.GG..A...GA..A..C...A..T
Polar bear 4 .T........A......CCGTA.GG..A....A..A......A..T
76 2 Genomes are the Hub of Biology
A 44.4 110.3
B 48.0 113.0
C 48.5 116.5
D 51.0 114.2
E 51.0 118.1
F 57.2 134.6
G 58.7 133.5
H 60.8 139.5
I 67.8 115.3
J 69.2 124.0
K 69.4 129.0
L 69.0 138.0
M 69.2 143.8
N 60.1 142.5
O 63.0 145.5
P 60.0 154.8
Q 63.1 151.0
R 68.5 158.0
S 65.5 165.0
T 55.2 162.7
U 58.3 155.0
V 60.9 161.2
(The reference sequence, from a brown bear, is the same as the reference sequence in
Table 2.2.) To which class of brown bears are these polar bears most closely related? Where
are the brown bears most closely related to polar bears found?
(d) For the brown bear sequences, compute the average number of sequence differences between
classes. Assuming a divergence rate of ∼0.125 sequence changes per 10 000 years, estimate
the times since divergence of the classes. Note that the total length of the region sequenced
was 294 bases.
Brown bears immigrated to North America from Asia over a temporary land bridge
over the Bering Strait. The first fossil evidence for brown bears in the New World is from
50 000–70 000 years ago. Is it likely that the classes diverged in North America or that they
had already diverged in Asia?
(e) Assume that (1) the original North American brown bear population contained all of the
currently observed haplotypes; and (2) there is now a continuous brown bear habitat covering
all areas listed in Table 2.3. What then accounts for the current geographical distribution of the
different haplotype classes? There are two questions to address:
• what accounted for their initial separation?
• what is continuing to keep them separated?
Exercises, problems, and weblems 77
Weblems
Weblem 2.1 Using the Online Mendelian Inheritance in Man™ site (http://www.ncbi.nlm.nih.
gov/omim), what are the clinical consequences of the mutations associated with (a) haemoglobin
Adana? (b) haemoglobin Malhacen?
Weblem 2.2 What animal models are available for the human mental diseases schizophrenia,
depression, and bipolar disorder?
Weblem 2.3 Mammalian females achieve dosage compensation by silencing one of the X
chromosomes in each cell. In birds, however, males are homogametic (ZZ) and females are
heterogametic (ZW), with the W chromosome being gene deficient like the mammalian Y
chromosome. Do male birds achieve dosage compensation by silencing one of their Z
chromosomes?
Weblem 2.4 Determine a feature of a human MHC haplotype that is associated with a relative
slow progression to AIDS after HIV infection.
Weblem 2.5 Which MHC haplotypes give indications for effectiveness of use of the following
drugs: (a) carbamazapine (epilepsy, bipolar disorder) (b) ximelabatran (an anticoagulant)?
Weblem 2.6 Suggest menus, as close as possible to normal diets, for breakfast, lunch, and dinner,
appropriate for a university student with PKU. Include recommended amounts.
Weblem 2.7 The three most frequent mutations in BRCA1 and BRCA2 genes in Ashkenazi Jews
are 185delAG and 5382insC in BRCA1 and 6174delT in BRCA2. In what exons of these genes do
these mutations appear?
Weblem 2.8 What are the most common mutations in BRCA1 among Swedish women?
This page intentionally left blank
CHAPTER 3
Mapping, Sequencing,
Annotation, and Databases
LEARNING GOALS
• To know some of the important landmarks in the historical background, including the classical
work of Darwin and Mendel, through Morgan and Sturtevant, to the more recent research
leading to the discovery of the double-helical structure of DNA and the development of the
human genome project.
• To distinguish different types of map – genetic linkage maps, chromosome banding patterns,
restriction maps, and DNA sequences – and the relationships among them.
• To understand the relationships among linkage, linkage disequilibrium, haplotypes and the
collection of single-nucleotide polymorphism (SNP) data.
• To understand how the basic principles of DNA sequencing developed from the initial
breakthroughs to the current automated high-throughput systems.
• To understand the primer extension reaction catalysed by DNA polymerase and its termination
by dideoxynucleoside triphosphates.
• To understand the basis and importance of the polymerase chain reaction (PCR) as a method for
amplification of selected DNA sequences within a mixture.
• To grasp the significance and relationships of reads, overlaps, contigs, and assemblies as parts of
an overall strategy and organization of a sequencing project.
• To be familiar with sequencing gels using autoradiography and to appreciate the advantages of
using fluorescent chain-terminating dideoxynucleoside triphosphates.
• To understand new developments in high-throughput sequencing, and the goals set for the next
few years of development.
• To be able to contrast hierarchical strategies of whole-genome sequencing based on maps and
BAC clones, with the whole-genome shotgun approach.
• To understand techniques of genetic testing based on sequence determination.
80 3 Mapping, Sequencing, Annotation, and Databases
Similarities between parents and offspring, within monastery, instead of becoming a teacher, because
human families, and in animals and plants, have he had failed the botany exam. Twice! The price of
always been obvious. An understanding of how insisting on original ideas.
heredity works is more recent.
A lack of understanding of the mechanism of
The story begins in a 10-year period starting in
heredity was a barrier to the development of Darwin’s
the late 1850s. Threads then cast on would require
insights. Mendel’s work supplied the crucial missing
a further century to ramify and intertwine.
ideas: the discreteness and persistence of the elements
• On 1 July 1858, Charles Darwin and Alfred Russel of hereditary transmission, later called genes. Never-
Wallace presented to the Linnaean Society of theless, although copies of the Proceedings of the
London their ideas on the development of species Natural History Society of Brünn were distributed
through natural selection of inheritable traits.* around the scientific community, Mendel’s work
Darwin’s book, The Origin of Species, appeared went largely unnoticed until it was rediscovered early
on 22 November 1859. in the 20th century.
• In 1860, Louis Pasteur published the observation There are several ironic aspects to this situation.
that the mould Penicillium glaucum preferentially 1. Among the recipients in Britain of the Brünn Soci-
metabolized the l-form of tartaric acid, one of two ety Proceedings were the Royal Society of London
mirror-image molecules that have identical struc- and the Linnaean Society. Darwin was a member
tures except that one is left-handed and the other of both and had access to their libraries. Even
right-handed. This work, based on Pasteur’s earlier more, Darwin personally owned a book by W.O.
separation of racemic tartaric acid by manual Focke, Plant Hybridisation, published in 1880,
selection of crystals of different shape, brought the which included a section describing Mendel’s
idea of three-dimensional molecular structure into work and its implications. When J.G. Romanes,
biology: to understand how a biological process preparing an article on hybrids for the Encyclo-
works, we must know the detailed spatial structure paedia Britannica, appealed to Darwin for help
of the relevant molecules. Thus began a long court- in making his review complete, Darwin sent his
ship between biology and molecular structure that copy of Focke’s book. But, in one of the nearest
has flowered into an intimate and fecund marriage. near-misses in scientific history, neither Darwin
• On 8 February and 8 March 1865, the monk nor Romanes read the section on Mendel’s work.
Gregor Mendel read his paper, Experiments on (How do we know this? The relevant pages were
Plant Hybridization, to the Natural History never cut open! The book is now in the Cam-
Society of Brünn in Moravia. (US President bridge University Library, with the pages still intact.)
Abraham Lincoln was assassinated on 15 April.) 2. Darwin came close to an independent statement
His paper was published the following year in the of Mendel’s conclusion, that traits that differ
Society’s Proceedings. Mendel had entered the between parents persist in the offspring, rather
than blend. In a letter to Thomas Huxley in 1857,
* Several earlier writers had suggested the idea of nat- Darwin wrote:
ural selection, including William Charles Wells and Patrick ‘I have lately been inclined to speculate, very crudely
Mayhew. It appears, for example, in the appendix to May-
and indistinctly, that propagation by true fertilisation
hew’s 1831 book, On Naval Timber and Arboriculture,
will turn out to be a sort of mixture, and not true fusion,
explicitly alluding to the possibility of creating novel spe-
cies. (Mayhew’s interest was in optimizing the growth of of two distinct individuals, or rather of innumerable
trees for building warships for the Royal Navy.) Mayhew individuals, as each parent has its parents and ancestors.
complained when The Origin of Species first appeared, and I can understand no other view of the way in which
Darwin gave credit to Wells and Mayhew in subsequent crossed forms go back to so large an extent to ancestral
editions. forms. But all this, of course, is infinitely crude.’
Maps and tour guides 81
Indeed, Darwin himself hybridized pea plants and showing that the virulent strain, even if killed,
observed segregation of traits! In 1866 he wrote contained a substance that could transform a non-
to Wallace: virulent strain into a virulent one and that the
‘I crossed the Painted Lady and Purple sweetpeas, which induced virulence is heritable. In 1944, Oswald
are very differently coloured varieties, and got, even Avery, Colin MacLeod, and Maclyn McCarty tested
out of the same pod, both varieties perfect but none different chemical components of the cell for trans-
intermediate.’ forming activity. They identified the DNA from
In the same letter, Darwin pointed out, as an the virulent strain as the molecule that induced the
obvious example of discrete rather than blending transformation. (We now interpret bacterial trans-
inheritance, that male and female parents give rise formation as ‘horizontal gene transfer’.) As controls,
to male and female offspring. they showed that transformation was inhibited by
enzymes that destroy DNA but not by enzymes
The legacy of the 1860s – Darwin, Pasteur, Mendel that destroy proteins. However, their contemporaries
– was completed by the discovery of DNA by Fried- were not receptive to their conclusions. General
rich Miescher in 1869. The structure and function of acceptance of the idea that DNA is the hereditary
DNA were equally unknown. However, microscopic material awaited the 1952 experiments of Alfred
observations of the role of chromatin in fertilization Hershey and Martha Chase, who showed that when
led quite early on to suggestions that Miescher’s bacteriophage T2 replicates itself in Escherichia coli,
substance was ‘responsible . . . for the transmission it is the viral DNA and not the viral protein that
of hereditary characteristics’.* The cell biologists got enters the host cell and carries the inherited charac-
there first, and got it right. teristics of the virus.
The idea of DNA as the hereditary material then
vanished for many years. It encountered considerable
resistance when it was subsequently proposed again.
• DNA was discovered in 1869. Avery, McLeod, and
McCarty showed in 1944 that DNA was the active
What is a gene? substance in bacterial transformation. Hershey and
Chase showed that during bacteriophage infection,
In 1931, Frederick Griffith studied virulent and
DNA but not protein entered the host cell.
non-virulent strains of Streptococcus pneumoniae,
Maps tell us where things are. More specifically, they 3. Restriction maps – DNA cleavage fragment patterns
tell us where things are in relation to other things. 4. DNA sequences.
In genomics, maps have been essential in revealing
the organization of the hereditary material. Genes, as discovered by Mendel, were entirely
Different types of map describe different types of abstract entities. Chromosomes are physical objects,
observation: with banding patterns as their visible landmarks.
Only with DNA sequences are we dealing directly
1. Linkage maps of genes with stored hereditary information in its physical
2. Banding patterns of chromosomes form. Restriction maps are in effect partial DNA
sequences – they give the positions of particular
* Hertwig, W.A.O. (1885). Das Problem der Befruchtung oligonucleotides within DNA molecules.
und der Isotropie des Eies, eine Theorie der Vererbung.
It was the very great achievement of the last century
Jenaische Zeitsch. f. Medizin u. Naturwiss. 18, 276–318.
For an article about Miescher’s scientific career, see Dahm, of biology to forge connections between these maps.
R. (2005). Friedrich Miescher and the discovery of DNA. A crucial idea that emerged from mapping is that
Dev. Biol. 278, 274–288. the organization of hereditary information is linear.
82 3 Mapping, Sequencing, Annotation, and Databases
Linked traits are governed by genes on the same he found was that genetic distance, as measured by
chromosome. However, in many cases linkage is crossing-over frequency, was additive. Consider three
incomplete. During gamete formation, alleles on dif- genes A, B, and C. Suppose that the distance from A
ferent chromosomes of a homologous pair can to B is 5 and the distance from B to C is 3. Then, if
recombine. This occurs as a result of crossing over, the distance from A to C is 8 = 3 + 5, the observations
the exchange of material between homologous chro- are consistent with a linear and additive structure
mosomes during copying in meiosis (see Figure 3.1). with gene order A–B–C. (Alternatively, the distances
Thomas Hunt Morgan, at Columbia University in would also be additive if the distance from A to C
New York City, USA, observed varying degrees of were 2, implying gene order A–C–B.) Note that addi-
linkage in different pairs of genes. He suggested that tivity of distances does not hold for points at the ver-
the extent of recombination could be a measure of tices of a triangle rather than on a line.
the distance between the genes on the chromosome. Sturtevant’s analysis made it possible to determine
Morgan’s student Alfred Sturtevant, then an under- the order of genes along each chromosome and to
graduate, made a crucial observation: the data were plot them along a line at positions consistent with
consistent with a linear distribution of genes. What the distances between them. The unit of length in a
gene map is the Morgan, defined by the relation
Parental genotype (diploid) that 1 cM corresponds to a 1% recombination fre-
A B quency. We now know that 1 cM is ∼106 bp in
a b
humans, but it varies with the location in the genome,
the distance between genes, and the gender of the
Meiosis parent: for males 1 cM is ∼1.05 Mb; for females,
Gametes Gametes 1 cM is ∼0.88 Mb. Crossing over is reduced in peri-
centromeric regions. Other regions are ‘hot spots’
A B A b
+ +
for crossing over. It is estimated that ∼80% of genetic
a b a B
recombination takes place in no more than ∼25% of
No recombination Recombination
our genome.
Linkage guides the search for genes. To identify the
Figure 3.1 Consider two loci on the same chromosome. One
gene responsible for a disease, look for a marker of
locus has alleles A and a, the other has alleles B and b. One
known location that tends to be co-inherited with
individual has alleles A and B on one chromosome (pink) and
alleles a and b (blue) on the other (top). Gametes from this the disease phenotype. The target gene is then likely
individual may form without recombination to give a haploid to be on the same chromosome, at a position near to
gamete containing alleles A and B and another haploid gamete the marker.
containing alleles a and b (lower left). Each gamete contains a
chromosome identical (at least as far as these loci are concerned)
to one of the parental chromosomes. Alternatively, crossover Linkage disequilibrium
between the two loci may produce recombinant gametes: one
haploid gamete containing alleles A and b and another haploid Figure 3.1 showed the gametes arising from one par-
gamete containing alleles a and B (lower right). Neither gamete ent, heterozygous for two traits. The recombination
contains a chromosome identical to a parental one. The fraction rate depends only on the structure of the chromo-
of gametes showing recombination depends on several factors,
somes of this individual – whether this genotype is
notably the distance between the loci. (The same fractions of
rare or common in a population.
recombinants and non-recombinants would be produced even
if the parent were homozygous, although they would be likely Now suppose that we have a large interbreeding
to be indistinguishable.) population and that every individual in the population
In the absence of selection, the fraction of viable recombinant has the same parental genotype shown in Figure 3.1.
and non-recombinant gametes produced depends only on How will the genotype distribution in the population
the genotype of the individual that produced them. It has
develop? (Assume that no combination of alleles for
nothing to do with the allele distribution in the population.
If a remote descendant had the parental genotype shown here,
these two traits has any selective advantage or pro-
its gametes would show the same fractions of recombinants and duces any preferential mating pattern.) Recombina-
non-recombinants. tion at meiosis, followed by zygote formation, can in
84 3 Mapping, Sequencing, Annotation, and Databases
principle produce individuals of three genotypes: and ab are more common than expected, D < 0
AB/ab, Ab/aB = aB/Ab and ab/ab (where, for instance, implies that chromosomes with Ab and aB are
AB/ab signifies an individual with alleles AB on one more common than expected.
chromosome and ab on the other; this is the parental
Suppose there is no selection or gene import into
genotype in Figure 3.1). Starting from the original
the population, i.e. the overall frequencies of indi-
completely AB/ab population, eventually recombina-
vidual alleles in the population remains constant.
tion will randomize the allelic correlation, producing
Then linkage equilibrium will decay as a result of
a population in which the ratio of genotypes is: AB/
recombination. With each successive generation, the
ab:Ab/aB = aB/Ab : ab/ab is 1:2:1. (This assumes that
value of D will become closer to 0.
the overall gene frequency of the population is A = a
and B = b; see Problem 3.1.)
Linkage and linkage disequilibrium are closely
How fast the randomization occurs depends on the
related but distinct concepts
recombination rate, which depends on the genetic
distance between the loci. In the short term, if recom- Linkage is about the distribution of loci among
bination is infrequent, the parental genotype AB/ab chromosomes. Linkage disequilibrium is about the
will continue to predominate. The deviation of the distribution of allelic patterns in populations. Close
genotype distribution in the population from the ulti- linkage of two loci on a chromosome is a common
mate 1:2:1 ratio is called linkage disequilibrium. source of long-term persistence of linkage disequilib-
rium. Two genes at opposite ends of the same chro-
mosome, although formally linked, may not show
• Two markers are in linkage disequilibrium if the significant linkage disequilibrium, because crossing
observed distribution of different combinations of alleles over is frequent. Conversely, it is possible – although
differs from that expected on the basis of independent rare – to observe linkage disequilibrium between two
hereditary transmission of the individual alleles.
genes on different chromosomes. (This can happen in
two ways: (1) a community of immigrants imports
a particular set of single-nucleotide polymorphisms
In the absence of linkage disequilibrium, the fre-
(SNP) into a larger population and they preferentially
quencies of allelic combinations will be proportional
inter-marry for many generations, or (2) theoretically
to the products of the frequencies of the individual
by interactions between gene products that permit
alleles. Linkage disequilibrium measures the devia-
only certain combinations of alleles to be viable.)
tion from this equilibrium distribution.
Classical linkage maps typically involved markers
If the overall fractions of the alleles at two loci are
no less than 1 cM apart (∼1 Mb in humans) (this is
pA, pa = 1 − pA, pB, pb = 1 − pB then:
the situation shown in Figure 3.2). Linkage disequi-
equilibrium value of pAB = pA × pB librium is detectable between markers ∼0.01–0.02
equilibrium value of pAb = pA × pb cM apart (∼10–20 kb). Therefore, linkage disequilib-
equilibrium value of paB = pa × pB rium is a much finer tool for localizing a target gene.
equilibrium value of pab = pa × pb.
Note that: Chromosome banding pattern maps
• there is no necessary relationship between pA and pB; Banding patterns are visible features on chromosomes
• at equilibrium: (see Box 3.2). The most commonly used pattern is
G-banding, produced by Giemsa stain. The bands
pAB × pab = pAb × paB = pA × pB × pa × pb;
reflect base composition and chromosome loop struc-
• a measure of linkage disequilibrium is: ture. The darker regions tend to contain highly con-
densed heterochromatin of relatively low GC/AT
D = pAB × pab − pAb × paB
ratio and sparse in gene content.
where D = 0 implies that the system is at equilib- The karyotype of an individual comprises the struc-
rium, D > 0 implies that chromosomes with AB tures of the individual chromosomes. The karyotype
Maps and tour guides 85
0.2 cM
BOX Nomenclature of chromosome
A r s t x M y u v w B 3.2 bands
1 cM
In many organisms, chromosomes are numbered in
Figure 3.2 Suppose a mutation to a disease gene, M, occurred order of size, 1 being the largest. The two arms of
in a human population 50 generations ago. Consider a portion of
human chromosomes, separated by the centromere, are
the genome that includes the site of the disease mutation, M, plus
called the p (petite = short) arm and q (queue) arm.
genes for two known phenotypic traits, A and B, and two closely
spaced markers, x and y. The genes for traits A and B are 1 cM Regions within the chromosome are numbered p1,
away from the mutated locus. The markers x and y are 0.1 cM p2 . . . and q1, q2 . . . outward from the centromere.
from M. Additional digits indicate band subdivisions. For exam-
It is highly probable that markers x and y will be co-inherited ple, certain bands on the q arm of human chromosome
with M in any pedigrees for which records exist, as the probability 15 are labelled 15q11.1, 15q11.2, and 15q12. Originally,
of recombination between x and M or between y and M in any bands 15q11 and 15q12 were defined; subsequently,
generation is small: 0.1% = 0.001. The probability of recombination 15q11 was divided into 15q11.1 and 15q11.2.
in 50 generations is approximately 0.001 × 50 = 0.05. On the
other hand, the probability that markers A and M or B and M
have been separated by recombination is very high. The probability
13
of recombination between A and M in any generation is 1%. The
probability of recombination in 50 generations is approximately 12
0.4 (see Problem 3.2).
In the history of transmission of this disease gene over 50
p 11.2
generations, markers x and y are likely to be co-inherited with the
disease, but genes for traits A and B will not be reliably coupled 11.1
with M. Therefore, genes A and B, separated by 1 cM (∼1 Mb in 11.1
the human), will not be a reliable guide to localizing the target 11.2
gene M by looking in family pedigrees for genes co-inherited 12
with M. However, a distribution of markers such as x and y, 13.1
13.2
separated by 0.2 cM (∼200 kb) or less, is likely to provide a 13.3
reliable guide to localizing the target gene M through correlation
14
of disease occurrence and genetic markers in family pedigrees.
15.1
For humans, we do not have access to 50 generations of 15.2
records and DNA samples (which would amount to about 15.3
1000 years). However, the effects of recombination during 21.1
21.2
the 50 generations since the mutation filter out all but the
most closely linked genes from the co-inheritance pedigree. q 21.3
22.1
Later we shall see that haplotype groupings simplify the 22.2
identification of gene–marker correspondences. 22.31
22.32
22.33
23
24.1
is largely constant for all individuals within a species, 24.2
24.3
25.1
but varies between species. This is the result of
25.2
chromosome rearrangement during evolution. The 25.3
inability of cells with incongruent karyotypes to pair 26.1
properly is one barrier to fertility that contributes to 26.2
species divergence. 26.3
Although most individuals of a species have the
same karyotype, occasionally aberrant chromosomes Deletions in the region 15q11–13 are associated with
appear. Some of them are lethal and others are Prader–Willi and Angelman syndromes. These syndromes
correlated with disease. (For example, Prader–Willi have the interesting feature that alternative clinical con-
or Angelman syndromes; see Box 3.2.) Studies of sequences depend on whether the affected chromo-
chromosome banding patterns support several types some is paternal (leading to Prader–Willi syndrome) or
of investigations. ➔
86 3 Mapping, Sequencing, Annotation, and Databases
• Complete sequencing of yeast chromosome III in 1992 If we compare our chromosomes with those of a
gave the first opportunity for direct comparison of chimpanzee, we see that some large-scale rearrange-
a genetic linkage map and positions in the DNA ments have taken place. Human chromosome 2 is
sequence. split into two separate chromosomes in the chimpan-
zee (see Figure 3.6). However, most regions in the
corresponding chromosomes of the two species show
The modern technique for mapping genes onto conservation of banding patterns. Such regions are
chromosomes is fluorescent in situ hybridization called syntenic blocks. For human and chimpanzee,
(FISH). A probe oligonucleotide sequence labelled full genome sequences are available. They confirm
with fluorescent dye is hybridized to a chromosome. that the relationships suggested by the comparisons
The location where the probe is bound shows up of banding patterns reflect conservation at the level
directly in a photograph of the chromosome. Typical of DNA sequences.
Maps and tour guides 87
Direction of
electrophoresis
individuals will supersede the cytogenetic detection EcoRI recognition site: GAATTC
of large-scale mutations. CTTAAG
Other useful types of marker include:
High-resolution maps, based directly on • Variable number tandem repeats (VNTRs), also
DNA sequences called minisatellites. VNTRs contain regions 10–
Formerly, we could see genomes only by the reflected 100 bp long, repeated a variable number of times
light of phenotypes. Now, markers are no longer lim- – same sequence, different number of repeats. In
ited to genes with phenotypically observable effects, any individual, VNTRs based on the same repeat
which are anyway too sparse for an adequately high- motif may appear only once in the genome or
resolution map of the human genome. Now that we several times, with different lengths on different
can interrogate DNA sequences directly, any features chromosomes. The distribution of the sizes of the
of DNA that vary among individuals can serve repeats is the marker. Inheritance of VNTRs can be
directly as markers. followed in a family and correlated with a disease
The first genetic markers based directly on DNA phenotype like any other trait. VNTRs were the
sequences, rather than on phenotypic traits, were first genetic sequence data used for personal iden-
restriction fragment length polymorphisms (RFLPs). tification – genetic fingerprints – in paternity and
The genetic marker is the size of the restriction frag- in criminal cases (see Chapter 8).
ments that contain a particular sequence within them • Short tandem repeat polymorphisms (STRPs), also
(see Figure 3.7). called microsatellites. STRPs are regions of only
Maps and tour guides 89
As an analogue of a restriction map: walk the entire length number of blocks between successive sites; then calculate
of Broadway in New York (see http://www.marktaw.com/ the number of blocks between every occurrence of either
local/MarksWalkingTour.html). Mark on the map the Starbucks or Citibank. A mutation in one of these ‘sites’ will
location of every Starbucks coffee shop and calculate the change the sizes of the fragments, allowing the mutation
number of blocks between successive sites; then mark to be located in the map.
the location of every CitiBank office and calculate the (Thanks to Professor B. Misra, New York University)
Restriction enzymes can produce fairly large In the past, the connections between chromosomes,
pieces of DNA. Cutting the DNA into smaller pieces, genes, and DNA sequences have been essential for
which are cloned and ordered by sequence overlaps, identifying the molecular deficits underlying inherited
produces a finer dissection of the DNA called a diseases, such as Huntington’s disease or cystic fibrosis.
contig map. Sequencing of the human genome has changed the
situation radically.
Life involves the controlled manipulation of matter, P.A. Levene’s idea that DNA contained a regular
energy, and information. Biochemists were familiar repetition of a constant four-nucleotide unit. It was,
with the structures and mechanisms of enzymes, therefore, considered that DNA simply lacked the
living molecules catalyzing conversions of matter versatility required to convey hereditary information,
and energy. What kind of molecule could store and compared, for example, with proteins. We now
manipulate information? know that eukaryotic DNA does contain repetitive
Chemical analysis of DNA during the early part of sequences, but is not limited to them.
the 20th century characterized the constituents – the
bases and the sugars – and the nature of their linkage.
DNA is a polynucleotide chain, containing a repeti- • This attitude contributed to the general lack of accep-
tive backbone of sugar–phosphate units, with small tance of the experiments of Avery, MacLeod, and
organic bases – adenine, thymine, guanine, and cyto- McCarty as proof that DNA was the genetic material.
sine – attached to each sugar (see Figure 3.9).
Not only was the distribution of the bases along
the chain unknown, its significance was entirely An understanding of how DNA worked in biolo-
unsuspected (except for a prescient comment by gical processes required a detailed three-dimensional
physicist Erwin Schrödinger – based on ideas of Max structure. In the 1950s, the method of choice for
Delbrück – in his influential 1944 book, What is determination of molecular structure was X-ray crys-
Life?: ‘We believe a gene – or perhaps the whole tallography. The pioneer of X-ray structure deter-
chromosome fibre – to be an aperiodic solid.’ Do not mination, Sir Lawrence Bragg, Cavendish Professor
be misled by the reference to a solid. Schrödinger of Physics at Cambridge, was an enthusiastic sup-
himself included a footnote: ‘That it [the chro- porter of the efforts of Max Perutz and his group
mosome] is highly flexible is no objection, so is a to extend methods of X-ray crystallography to bio-
thin copper wire.’) Indeed, many people accepted logical molecules as large as proteins.
Discovery of the structure of DNA 91
N N H O CH3
N
N H N
Sugar N N
O
Adenine Sugar
Thymine
N O H N
Figure 3.10 X-ray diffraction pattern of DNA fibre, by R. Franklin.
The X-shaped pattern at the centre of the picture is diagnostic of N
a helical structure. From the distribution of intensity, it is possible N H N
to deduce the number of residues per turn and the symmetry of Sugar N N
the structure. N H O
Guanine Sugar
From: Franklin, R.E., & Gosling, R.G. (1953). The structure of sodium
H
thymonucleate fibres. II. The cylindrically symmetrical Patterson function, Cytosine
Acta Cryst. 6, 678–685.
Figure 3.11 The complementary base pairings: adenine–thymine
and guanine–cytosine.
Figure 3.12 The structure of DNA. The two helical strands wind around the outside of the structure. The bases are inside, stacked like
the treads of a staircase with their planes perpendicular to the axis of the double helix. The bases are visible through the major and
minor grooves and are thereby accessible for interaction with proteins. This picture shows a stereo pair, most easily viewed with a
standard stereo viewer or lorgnette.
The story has been told on many occasions, in to the discovery of the structure of DNA were under-
print and on film. Notable for its unusually sensa- estimated, not least by Watson, who in The Double
tional treatment of scientific history is Watson’s 1968 Helix disparages her mercilessly, both professionally
autobiography, The Double Helix. A collection of and personally. Of numerous recent attempts to
the reviews it elicited still makes interesting reading.† redress the balance, Sir Aaron Klug’s lecture is the
There is now consensus that Franklin’s contributions most authoritative.‡
DNA sequencing
In 1953, after the Hershey–Chase experiment and But if the sequence of the bases was like a text every-
the announcement of the double helix, molecular one wanted to read, not only was it a text in an
biologists knew that DNA contained the hereditary unknown language, but there were not even any
information. They saw how cellular machinery could examples of the language, because the sequences
achieve access to it, using base-pair complementarity. were unknown. The importance of the problem of
† ‡
Stent, G. (ed.) (1980). The Double Helix: Text, Comment- Klug, A. (2003). The discovery of the DNA double helix.
ary, Reviews, Original Papers. W.W. Norton, New York & In Changing Science and Society. T. Krude (ed.) Cambridge
London. University Press, Cambridge, pp. 5–43.
94 3 Mapping, Sequencing, Annotation, and Databases
for sequencing relatively long fragments, making And pity ’tis ’tis true
unambiguous assembly possible. It would be more difficult to reassemble this from frag-
Sanger’s method used DNA polymerase to synthe- ments, because the repetitions create ambiguities.
size a new strand of DNA, embodying, as part of the Indeed, the very large amounts of repetitive sequence in
synthesis, reactions that reveal the sequence. eukaryotic genomes do create problems for assembly
DNA polymerase is a replication enzyme that algorithms.
synthesizes the strand complementary to a piece of
single-stranded DNA. It requires a primer: a short
stretch of complementary strand to be extended by the growing primer strand the nucleotide comple-
successive addition of nucleotides (see Figure 3.13). mentary to the next unpaired base in the template.
The polymerase requires a supply of nucleoside tri- This reaction also forms the basis of the polymerase
phosphates. In successive steps, the enzyme adds to chain reaction (PCR) (see Figure 3.14).
DNA sequencing 95
Blue strand 3′ 5′
= primer
Original sample
5′ 3′
Separate strands,
add primers
Red strand
= template
Extend primers
NH2 NH2 G A T C
N N
P-P-P-OCH2 O N O P-P-P-OCH2 O N O
H H H H
H H
OH H H H GC 4380
Deoxycytidine Dideoxycytidine GC A GT T T T 4370
G G T T AC
triphosphate triphosphate AGA C
C G AGA 4360
AA
GAG A 4350
Figure 3.15 Left: normal deoxycytidine triphosphate, containing G TTTT
G T
both a reactive 5′-hydroxyl, activated as the triphosphate AT
A
derivative permitting it to be added to the growing primer strand, CGAA A 4340
G G G A G 4330
and a 3′-hydroxyl (red), to which the next nucleotide will be C G A A
C
C G C C C 4320
attached. Right: dideoxycytidine triphosphate, in which the 3′ C A
AC
position is unreactive. Dideoxycytidine can be incorporated into AG
T A T C 4310
the growing strand, but no subsequent extension is possible. CA
AA
P-P-P is a simplified representation of the triphosphate group. AAT
T T T A 4300
A
TG C T
A A
3′-atacagagaatctagatacagagttgttcgag Template C
C C 4290
C
5′-tatgtctcttagat → Primer extension TC C
G TG
T
5′-tatgtctcttagatctatgtctcaacaagctc CA
A 4280
5′-tatgtctcttagatctatgtctcaacaagc Fragments T A
GC
5′-tatgtctcttagatctatgtctcaac produced A A T
5′-tatgtctcttagatctatgtctc A C 4270
C
G
5′-tatgtctcttagatctatgtc G
TT
A
5′-tatgtctcttagatc GA
T 4260
5′-tatgtctc
5′-tatgtc Figure 3.16 A gel from early in the history of DNA sequencing,
showing part of the genome of bacteriophage fX-174.
The crucial point is that the fragments are nested.
From: Sanger, F., Nicklen, S., & Coulson, A.R. (1977). DNA sequencing with
To determine the positions of the cytosine residues in chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 74, 5463–5467.
the extended primer, it is sufficient to determine the
lengths of the fragments. Polyacrylamide gel electro- In Sanger’s original procedure, the dideoxycytidine
phoresis can separate oligonucleotides according to carried a radioactive label and the gel was developed
their lengths accurately enough for sequence deter- by autoradiography. To determine the positions of
mination. It is possible to separate polynucleotides the other three nucleotides, it was necessary to run
differing in length by a single base, up to molecules four reactions, in parallel, each containing radioac-
about 1000 bases long. tive dideoxy analogues of one of the four nucleoside
triphosphates. Four separate reactions are necessary
because each nucleotide gives the same signal – the
• Almost all DNA sequencing methods, from the earliest
darkening of the film from the radioactive spot. Run-
to those in most common use today, depend on break-
ning the products of the four reactions in parallel
ing a long DNA molecule into fragments, sequencing
the fragments, and then assembling them, via over-
lanes on the same gel allows the sequence to be read
laps, into the complete sequence. The methods of from the autoradiograph (see Figure 3.16).
Sanger and of Maxam and Gilbert worked by creating
a set of nested fragments, each terminating in a known • Labelling the four dideoxynucleoside triphosphates with
nucleotide, that could be separated on a gel. Knowing different fluorescent dyes, with distinguishable colours,
the length of the fragment and the terminal nucleotide allows ‘one-pot-one-lane’ sequencing reactions (see
allowed the sequence to be ‘read off’ the gel. pp. 18 and 97).
DNA sequencing 97
Figure 3.17 A tracing of a sequencing fluorescent chromatogram (simulated). Different-coloured peaks correspond to different bases.
The Maxam–Gilbert chemical cleavage method successive peaks correspond to successive bases in
the sequence (see Figure 3.17).
The Maxam–Gilbert method for DNA sequencing, also
Quality is important too: The phred score q meas-
developed in the mid-1970s, also worked by compar-
ures sequencing accuracy. q = 20 implies a probability
ing nested fragments. A sample of single-stranded
of ≤1% error per base (see Box 3.7). A common unit
DNA, labelled at its 5′-end, is cleaved by base-specific
of sequencing costs is US$ per 1000 bases determined
reagents. As in Sanger’s method, polyacrylamide gel
to q = 20 accuracy. A sequence of 1000 q = 20 bases
electrophoresis separates the fragments by size. The
would be expected to contain no more than ten
separate cleavage reactions produce fragments that
errors.
share the 5′-labelled end, but differ in the bases at
In 1986, L. Hood, L. Smith, and co-workers
which the specific 3′-cleavage occurs. The sequence
described an instrument, based on detection of base-
can be read from an autoradiograph.
specific fluorescent tags, that led to the automation of
The Maxam–Gilbert method had early successes,
DNA sequencing. Instruments produced by Applied
for which Gilbert shared the Nobel Prize. It was
Biosystems implemented the work of Hood, Smith,
Sanger’s method that spawned subsequent develop-
and co-workers, with improvements by J.M. Prober
ments, however. Because the Maxam–Gilbert method
and co-workers at DuPont. Capillary systems for
does not use primed DNA synthesis, its applicabil-
fragment separation to replace flat-sheet gels were
ity is inherently limited to sequences adjacent to
another essential advance. A population explosion in
restriction sites or other fixed termini. Another dis-
advantage of the Maxam–Gilbert method was reagent
toxicity, notably of hydrazine, a neurotoxin.
BOX Phred scores: a measure of quality
3.7 of sequence determination
Automated DNA sequencing
Using fluorescent dyes as reporters, rather than The phred score of a sequence determination is a mea-
radioactivity, was an important technical advance in sure of sequence quality. It specifies the probability that
sequencing. Radioisotopes present health hazards the base reported is correct.
in both use and disposal, and are expensive. (One of
If p = the probability that a base is in error, then
Sanger’s ‘postdocs’, who wore his hair long in the the corresponding phred score q = −10 log10(p).
fashion of the 1970s, was obliged to have a haircut
because of contamination from frequently pushing Here is a short table:
his hair out of his eyes.) By attaching different
fluorescent dyes to the four dideoxynucleoside tri- Quality Probability Error rate
phosphates, each fragment produced gives a different score q of error
signal, depending on which dideoxynucleotide ter- 10 0.1 1 base in 10 wrong
minated the extension. All four reactions can be 20 0.01 1 base in 100 wrong
done ‘in the same pot’, and electrophoresis separates
30 0.001 1 base in 1000 wrong
them in a single lane. A laser focused at a fixed point
40 0.0001 1 base in 10 000 wrong
identifies the fragments as they pass. The result can
be displayed as a four-colour chromatogram in which
98 3 Mapping, Sequencing, Annotation, and Databases
Cost of sequencing
1E+05
1E+03
1E+01
Cost per base (US$)
1E-01
1E-03
1E-05
?
1E-07
?
1960 1970 1980 1990 2000 2010 2020
Year
Figure 3.18 Fall in the cost of sequencing over time. Note logarithmic scale on the cost axis. The sea change in 1998 from the
introduction of next-generation sequencing platforms is striking. The US National Institutes of Health set goals for US$100 000
human genome in 2010, and a US$1000 human genome in 2016. Has the 2010 target been met? (See Weblem 3.4.)
(Data from: http://www.genome.gov/sequencingcosts/)
automatic DNA sequencing machines filled the high- Current goals are to improve the technology by
throughput installations that produced the human additional orders of magnitude. The US National
and other complete genomes. Instruments available Institute of Health (NIH) has set goals for a
in 1998 were able to produce ∼1 Mb of sequence US$100 000 human genome by 2009 and a US$1000
per day. genome by 2014 (see Figure 3.18).
Two general approaches to genome sequencing pro- • First, cut the DNA into fragments of about 150 kb.
jects are: Clone them into BACs. For example, Arabidopsis
thaliana has a haploid genome size of about
1. The hierarchical method, in which the whole
108 bp. A 3948 clone BAC library for A. thaliana
genome is first fragmented and cloned into bacter-
contains ∼100 kb inserts per clone, giving approxi-
ial artificial chromosomes (BACs) (see Box 3.8),
mately fourfold coverage.
and the order of the fragments is established
before sequencing them. • Identify a series of clones in the library that con-
2. The whole-genome shotgun method, which works tains overlapping fragments. Although referred to
directly with large numbers of smaller fragments, as ‘fingerprinting’, this process depends on shared
with a concomitantly more challenging assembly (rather than unique) features of overlapping clones,
problem. including:
1. overlap of restriction fragment size patterns
Bring on the clones: hierarchical – or 2. amplification of single-copy DNA between
‘BAC-to-BAC’ – genome sequencing interspersed repeat elements and checking for
One approach to organizing the sequencing of a large similar size patterns of fragments
DNA molecule involves dividing the sample into 3. mapping sequence-tagged sites (STSs) and look-
pieces of known relative position. ing for fragments sharing STSs.
Organizing a large-scale sequencing project 99
Luciferin Oxyluciferin (a) The first two positions include all possible
dinucleotides.
Figure 3.22 Detection of matching nucleotides by the (b) The remaining six positions are ‘wild cards’ –
luciferase reaction to signal the appearance of PPi released molecules that will pair with all four bases.
when a nucleotide is incorporated.
(c) The 5′ end of the probe bears one of four fluores-
Each cycle of successive exposure to the four nucleo- cent tags. Because there are four tags and 16
side triphosphates produces one sequenced base. dinucleotides, the system is degenerate, each tag
* As in many plays, including The Merchant of Venice,
representing four of the sixteen dinucleotides.
Turandot, etc., in which an eligible woman is offered several
suitors in turn. Figure 3.23 shows how the system works. The first
probe bound to one of the fragments starts with TC,
implying that the first two bases from the original
fragment (after the adaptor) are AG. However,
The fragments are sequenced by synthesis. Flooding because the fluorescent tag, yellow in this diagram,
the system with a polymerase and one fluorescently represents AG, CT, GA, or TC, we only know that
labelled nucleotide results in incorporation of the the first dinucleotide must be one of these four. We
nucleotide onto each fragment that has the comple- don’t yet know the identity of any one base.
mentary base adjacent to the growing primer. As in Remove the last three nucleotides, and repeat the
the Roche 454 Life Sciences system, an image of the process. After the next step, we know that the dinu-
system shows fluorescent spots at the positions at cleotide offset by three positions from the first must
which the nucleotide was incorporated. After wash- be one of four possibilities corresponding to the red
ing and removal of the fluorescent tag, the process tag. In summary, what we know about the target
repeats with the other three nucleotides, in succes- sequence at this point is that it must have the follow-
sion. The four images reveal the distribution of the ing form:
incorporated nucleotides. Then the process repeats, position 1 2 3 4 5 6 7
for the next position. A G ? ? ? A T
Note that in this system there is no amplification or C T or C G
step. or G A or G C
or T C or T A
Illumina Solexa. In the preparation step, fragments
bound to a surface are amplified in situ. They form This process continues for a total seven cycles, giving
distinct clusters on the surface. At each step, the poly- us additional partial information about 35 bases in
merase adds a base. The four bases have four differ- the sequence.
ent fluorescent tags. Therefore the distribution of How are the ambiguities resolved? By generating
colours, in an image of the field, identifies which base overlapping information. The newly synthesized strand
was added to each cluster. Remove the fluorescent is removed. A new primer is added, which binds at
tags, and repeat. The result is a kaleidoscopic movie one position offset from the first. Now the first dinu-
of shifting colours, one frame per position. cleotide probe that binds has a blue tag (Figure 3.23)
Organizing a large-scale sequencing project 103
A A G C T A G C T
2
3’ ❋ ❋ ❋ ❋ ❋❋ ❋❋ ❋ ❋ ❋❋ ❋❋
Universal seq primer (n-2)
3
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋ ❋ ❋❋ ❋❋
Universal seq primer (n-3)
4
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋ ❋ ❋❋
Universal seq primer (n-4)
5
3’ ❋ ❋ ❋❋ ❋ ❋ ❋❋ ❋❋ ❋❋ ❋❋
Figure 3.24 In the Applied Biosystems Solid high-throughput sequencing platform, five rounds of primer reset are completed for
each primer. By reprocessing the same fragment with a shifted primer, overlapping signals provide enough information to resolve the
ambiguities arising from the fact that sets of four of the sixteen possible dinucleotide probes have the same fluorescent tag. Through
the sliding-primer process, almost every base is interrogated in two independent binding and ligation reactions by two different primers.
For example, the base at read position 20 is tested by primer n-1 in its fifth cycle, and by primer n-2 in its fourth cycle. This double
testing is not only necessary to resolve ambiguities in labelling, but provides a sensitive mechanism of error detection.
(Photo courtesy of Life Technologies.)
was carried out before the newer equipment an average of 76 clones. The average length of the
became available.) contigs was 2300 kb.
The genome was assembled using a genetic map,
The high-throughput sequencing platforms pro- maps based on BAC clones, and comparison with the
duced the amounts of raw data shown in Table 3.1 chicken genome, which was already available.
(see Exercise 3.14). Only for the Z chromosome was the chicken relied
Sequencing of BACs produced an integrated phys- on for assembly. It would be appropriate to regard
ical and genetic map of the turkey genome. 725 con- the turkey genome as an independenet, de novo,
tigs assembled from the BAC sequence data comprised assembly.
After completion of sequence and annotation, a The many databanks form interlocking networks.
genome enters the databanks of molecular biology – Release of a genome into any of the major archival
‘to take its place in society’. projects is like casting a stone into a lake, sending
Databanks in molecular biology 105
ripples through the whole system. The genome itself ordinated access to them, led to the forging of links
is a nucleic acid sequence, but the protein-coding genes among the databases. The World Wide Web made
it contains will, after translation into amino acid this possible. The results were (1) systematization of
sequences, contribute to protein sequence databanks. formats and vocabularies, so that data in different
Databases in molecular biology have grown with collections became compatible; (2) pointers or links
astonishing rapidity recently. Their development fol- from each databank to others, facilitating access to
lows rules of their own. Here we can only describe related material; and (3) development of information-
general principles and guide the reader to his or her retrieval software that would streamline access to
own exploration of this world. (See the Online different databanks and smooth the passage from
Resource Centre associated with this book.) retrieving data to subjecting the results to computa-
The original databases were small, specialized, and tional analysis. Gathering all sequences in birds
– by post-web standards – isolated. It has always homologous to a given human protein is a problem
been true that different types of information need to in information retrieval; forming a multiple sequence
be curated by people with the appropriate expertise. alignment of these sequences is computational ana-
Specialists in different areas of biology organized lysis. Smooth passage means a simple pipeline from
the archiving of data related to their interests. One the sequences returned by the database search into a
problem was that even where data overlapped – for multiple sequence alignment program.
instance, amino acid sequences are common to pro- In summary, the requirements of a major database
tein sequence and structure databases – there was project include the following:
relatively little effort to use controlled vocabularies
1. Harvest the data plus annotations, curate them –
and to make storage formats compatible. The Inter-
that is, check both for accuracy and format – and
national Scientific Unions and CODATA (the Com-
distribute them.
mittee on Data of the International Council of
Scientific Unions) made important contributions, but 2. Track and back up the data so that it does not
problems remained. get lost. Should any question arise, it should be
With (1) the growing recognition of the import- possible to trace the data back to their origin and
ance of bioinformatics to research in biology, (2) the review all subsequent actions performed on them.
spectacular increases in the quantity of data, and – 3. Provide links from the data to relevant items
above all – (3) the emergence of the internet and in other databanks, including bibliographical
World Wide Web, came pressure towards growth libraries such as PubMed.
and integration of databanks. Requirements for large- 4. Provide information retrieval and analysis soft-
scale funding, and the need for combining biological ware to support a research pipeline that includes
and computational expertise, led to the creation of both recovery of selected data and calculations
national and international institutions responsible with them.
for archiving and curating the data. 5. Provide ample documentation and tutorial infor-
The term ‘databank’ suggests a metaphor that per- mation, so that users can make effective use of the
haps is outdated: a bank as a safe place for something facilities.
valuable, from which you can make a withdrawal
6. Keep up with scientific advances in both biology
and then go off shopping up and down the high
and informatics. These may suggest improve-
street. This description emphasizes the archiving and
ments in the presentation and facilities.
curation activities of the databanks. Of course, these
activities remain absolutely essential. However, the 7. Be responsive to users’ needs.
databanks also provide facilities – or at least links – Primary data collections related to biological mac-
for the computational analysis of the information romolecules include:
recovered. (More like banks within large shopping
malls.) • nucleic acid sequences, including whole-genome
The realization that different data types were not projects;
intellectual islands, but that researchers needed co- • amino acid sequences of proteins;
106 3 Mapping, Sequencing, Annotation, and Databases
database, the OMIM Morbid Map, deals with genetic of conformations of the component units of bio-
diseases and their chromosomal locations. OMIA logical macromolecules, and for investigations of
(Online Mendelian Inheritance in Animals) is a cor- macromolecule–ligand interactions, including but
responding database for disease and other inherited not limited to applications to drug design. The
traits in animals – excluding human and mouse. Nucleic Acid Structure Databank (NDB) at Rutgers
University, New Brunswick, New Jersey, USA, com-
plements the wwPDB.
Databases of structures
Structure databases archive, annotate, and distribute Classifications of protein structures
sets of atomic coordinates.
Approximately 80 000 protein structures are now Several web sites offer hierarchical classifications of
known. Most were determined by X-ray crystallo- the entire PDB according to the folding patterns of
graphy or nuclear magnetic resonance (NMR). The the proteins. These include:
Worldwide Protein Data Bank (wwPDB) now com-
• SCOP: structural classification of proteins
prises four collaborating primary archival projects
• CATH: class/architecture/topology/homologous
to integrate the archiving and distribution of experi-
superfamily
mentally determined biological macromolecular
structures: • DALI: based on extraction of similar structures
from distance matrices
• The Research Collaboratory for Structural Bio-
• CE: a database of structural alignments.
informatics (RCSB), in the USA
• The Protein Data Bank in Europe (PDBe), at EBI, These sites are useful general entry points to pro-
UK tein structural data. For instance, SCOP offers facili-
ties for searching on keywords to identify structures,
• The Protein Data Bank Japan (Osaka, Japan)
navigation up and down the hierarchy, generation of
• The Biological Magnetic Resonance Data Bank
pictures, access to the annotation records in the PDB
(BMRB), in the USA.
entries, and links to related databases.
The wwPDB sites accept depositions, process new
entries, and maintain the archives.
Specialized or ‘boutique’ databases
These and many other web sites organize and pro-
vide access to these data, including but not limited Many individuals or groups select, annotate, and re-
to pictorial displays. Naturally, there is considerable combine data focused on particular topics, and include
overlap among them. Each has its own strengths, links affording streamlined access to information
based in many cases on the research interests of about subjects of interest. For instance, the Protein
the contributing scientists: the PDBe has recently Kinase Resource is a specialized compilation that
embarked on an ambitious software development includes sequences, structures, functional information,
program for structural analysis. Many sites offer laboratory procedures, lists of interested scientists,
search facilities to identify structures of interest, tools for analysis, a bulletin board, and links. It has
based on the presence of keywords (or a logical com- recently been redesigned, with a view to integrating
bination of keywords), or numerical values such as expanded information content with a workbench
the year of deposition. Different sites differ also in equipped with embedded tools for launching ana-
their ‘look and feel’, and users will discover their lyses of the data within the user interface.
own preferences.
The wwPDB overlaps in scope with several other
Expression and proteomics databases
databases. The Cambridge Crystallographic Data
Centre (CCDC) archives the structures of small mole- Recall the central dogma: DNA makes RNA makes
cules. This information is extremely useful in studies protein. Genomic databases contain DNA sequences.
108 3 Mapping, Sequencing, Annotation, and Databases
Expression databases record measurements of mRNA Table 3.2 Species with the largest numbers of entries in dbEST
levels, usually via ESTs (expressed sequence tags:
Species Number
short terminal sequences of cDNA synthesized from
of entries
mRNA) describing patterns of gene transcription.
Proteomics databases record measurements on pro- Homo sapiens (human) 8 314 483
teins, describing patterns of gene translation. Mus musculus + domesticus (mouse) 4 853 533
Comparisons of expression patterns give clues to Zea mays (maize) 2 019 105
(1) the function and mechanism of action of gene Sus scrofa (pig) 1 620 479
products; (2) how organisms coordinate their control Bos taurus (cattle) 1 559 494
Arabidopsis thaliana (thale cress) 1 529 700
over metabolic processes in different conditions – for
Danio rerio (zebra fish) 1 481 937
instance, yeast under aerobic or anaerobic condi-
Glycine max (soybean) 1 461 624
tions; (3) the variations in mobilization of genes in
Xenopus (Silurana) tropicalis 1 271 375
different tissues, or at different stages of the cell cycle,
(western clawed frog)
or of the development of an organism; (4) mech- Oryza sativa (rice) 1 251 304
anisms of antibiotic resistance in bacteria and con- Ciona intestinalis 1 205 674
sequent suggestion of targets for drug development; Rattus norvegicus + sp. (rat) 1 162 136
(5) the response to challenge by a parasite; (6) the Triticum aestivum (wheat) 1 071 367
response to medications of different types and dos- Drosophila melanogaster (fruit fly) 821 005
ages, to guide effective therapy. Xenopus laevis (African clawed frog) 677 806
There are many databases of ESTs. In most, the Oryzias latipes (Japanese medaka) 666 891
entries contain fields indicating tissue of origin and/ Brassica napus (oilseed rape) 643 874
or subcellular location, stage of development, condi- Gallus gallus (chicken) 600 423
tions of growth, and quantification of expression Panicum virgatum (switchgrass) 546 245
level. Within GenBank, the dbEST collection cur- Hordeum vulgare + subsp. vulgare (barley) 501 620
rently contains almost 70 million entries, from 2281 Salmo salar (Atlantic salmon) 498 212
species. The species with the largest numbers of entries Caenorhabditis elegans (nematode) 393 714
● RECOMMENDED READING
Exercises
Exercise 3.1 Two loci with alternative alleles A/a and B/b, respectively, are 1 cM apart. A cross
between parents of genotype AB/AB and ab/ab produces a large number of offspring. Assuming
no selective difference between genotypes, estimate the fraction of the next generation that has
genotype Ab/aB.
Exercise 3.2 Gene A has two alleles, A1 and A2. Gene B has two alleles, B1 and B2. In a population,
the following haplotype frequencies are observed: A1B1 = 0.2, A2B2 = 0.45, A1B2 = 0.15, A2B1 = 0.2.
Calculate D, the extent of linkage disequilibrium.
Exercise 3.3 A fruitfly with a chromosome deletion shows pseudodominance for a trait. On a
photocopy of Figure 3.3, indicate with an ‘X’ where the locus for this trait might be.
Exercise 3.4 The Philadelphia translocation occurs in a bone marrow cell, resulting in the
development of chronic myeloid leukaemia. Would the patient transmit this leukaemia to his
or her offspring?
Exercise 3.5 On a photocopy of Figure 3.6, indicate two positions in the human chromosome
where one would look for genes that are linked in humans but unlinked in chimpanzees?
Exercise 3.6 (a) On a photocopy of Figure 3.8(a), indicate with an ‘A’ the band on the gel
that corresponds to the BamHI fragment in Figure 3.8(b) which begins at 1 kb and ends at 5 kb.
(b) On a photocopy of Figure 3.8(b), indicate with a ‘B’ the fragment that gives rise to the band
in the EcoRI lane of the gel which corresponds to the lowest molecular mass fragment.
Exercises, problems, and weblems 111
Exercise 3.7 The lengths of blocks that define human haplotypes vary, partly because of the
variation of recombination rates along the genome. Would you expect the haplotype blocks
to vary more if their sizes are measured in terms of number of base pairs or in terms of genetic
distance in centimorgans?
Exercise 3.8 Suppose that there are ten SNPs in a 10 kb region. If the region is on the Y
chromosome, how many possible haplotypes are there? If the region is on a diploid chromosome,
how many possible haplotypes are there?
Exercise 3.9 On a photocopy of Figure 3.9, indicate, by crossing atoms out and writing atoms in,
how the structure would have to be changed to illustrate an RNA molecule with the equivalent
base sequence.
Exercise 3.10 From a photocopy of Figure 3.11, cut out the individual bases and show that
guanine and thymine could form a non-canonical base pair containing two hydrogen bonds.
By comparing with a copy of the standard base pairs, show the extent to which a guanine–
thymine pair would not match the correct relative position and orientation of the sugars to be
stereochemically compatible with standard base pairs in a double helix of standard structure.
Uracil has the same hydrogen-bonding specificity as thymine. Guanine–uracil ‘wobble’ base
pairs are implicated in codon–anticodon interactions between tRNA and mRNA.
Exercise 3.11 The tetranucleotide illustrated in Figure 3.9 is self-complementary. (a) What does
this mean? (b) Make two photocopies of Figure 3.9. From one of them trim off the names of the
bases. From the other, cut out the individual bases and mount them adjacent to the first in position
to form Watson–Crick base pairs. Draw in the hydrogen bonds between bases. (It will not work
simply by turning one copy upside down and mounting it next to the other. In a double helix,
the two copies are symmetrically disposed in three dimensions but not in two dimensions.)
Exercise 3.12 If all of the DNA in all of the cells of your body were laid end to end, would you be
surprised if it were longer than the diameter of the solar system? Calculate the result and compare.
The semi-major axis of Pluto’s orbit is 5 906 376 272 km. The number of cells in an adult human
body has been estimated as 1013.
Exercise 3.13 In Figure 3.13, the region of the template strand not complexed with the primer is
shown as continuing the helical structure. Although justifiable pedagogically for clarity, why is this
not a structurally correct representation?
Exercise 3.14 On a photocopy of Figure 3.19, indicate the longest contig available from the data
given.
Exercise 3.15 In Figure 3.19, (a) What is the minimal coverage of any position (this is obvious)?
(b) What is the maximal coverage of any position (i.e. the largest number of fragments in which
the same position appears)? (c) Estimate the average coverage of the entire region? (Hint:
measure the total lengths of the fragments and divide by the length of the region.)
Exercise 3.16 The International Human Genome Mapping Consortium fingerprinted 300 000 BAC
clones. Assuming an average insert size of 150 kb and a 3.2 Gb genome size, what coverage
would be expected?
Exercise 3.17 One difficulty in extracting reads that correspond to mitochondrial DNA from
sequencing mixed fragments of nuclear and mitochondrial DNA is that the nuclear genome
contains segments homologous to regions of the mitochondrial genome, called numts.
Mammalian genomes contain 50–450 kb of numts. (The human genome contains 1005 such
segments, of average length 446 bp.) Estimate the fraction of reads from fragments of mammoth
DNA that are likely to be numts. The mammoth genome is 4.7 Gb long.
Exercise 3.18 Referring to Figure 3.24, by what primers, in which cycle, is the base at position
10 tested?
112 3 Mapping, Sequencing, Annotation, and Databases
Exercise 3.19 Referring to Figure 3.23, suppose that the dinucleotide that bound to positions
23456789 gave a green fluorescence. What is the base at position 3?
Exercise 3.20 How much raw sequence data was generated for the turkey genome project?
How many human genome equivalents does this amount to?
Problems
Problem 3.1 Consider two linked traits in a population in which half of the individuals are
double heterozygotes with genotype AB/ab and the other half are double homozygotes (AB/AB).
Assuming no selective advantage of any combination of alleles for these traits and no preferential
mating, after recombination brings the population to equilibrium, what will be the ratio of AB/AB,
Ab/aB = aB/Ab and ab/ab individuals?
Problem 3.2 Consider two markers 1 cM apart. (a) What is the probability that there will be
recombination between them in one generation? (b) What is the probability that there will not be
recombination between them in one generation? (c) What is the formula for the probability that
there will not be recombination between them in n generations? (d) Evaluating this formula, what
is the probability that there will not be recombination between them in n = 10, 20, 30, 40, and
50 generations?
Problem 3.3 As a simplified but illustrative example of sequence assembly, we saw the first two
verses of Richard III chopped into overlapping 10-character fragments. (a) Chop these lines into
consecutive overlapping 5-character fragments, and scramble these fragments into random order.
Is it still possible to reconstruct the lines without ambiguity? Why is it more difficult to do so than
to reconstruct the lines from 10-character fragments? (b) Try generating, and then trying to
reassemble, 10- and 5-character fragments, presented in random order, of the lines of Polonius
(ignore punctuation marks):
… ‘tis true, ‘tis true ‘tis pity,
And pity ‘tis ‘tis true
Problem 3.4 Extend Problem 3.3 to simulate the effect of paired-end reads. Take any text of
about 100 words in length (a sonnet is about the right length) and write a program to create
fragments with a distribution of lengths distributed roughly normally around 30 ± 5 characters.
Print reads of 8 characters from each end. Tabulate the data from different fragments in random
order. Try to reassemble the text. Study how the difficulty of the assembly depends on the read
length, fragment length, and coverage.
Problem 3.5 Lander & Waterman* derived formulas for the expected completeness of an
assembly as a function of coverage (G = genome length, N = number of reads, L = read length,
c = NL/G = coverage):
probability that a base is not sequenced = e−c
total expected gap length = G × e−c
total number of gaps = Ne−c
* Lander, E.S. & Waterman, M.S. (1988). Genomic mapping by fingerprinting random clones:
a mathematical analysis. Genomics 2, 231–239.
Exercises, problems, and weblems 113
(a) What fraction of a genome could you expect to assemble from eightfold coverage? (b) What A T G C
total gap length would you expect in an assembly of a 2 Mb target genome size from eightfold
coverage? (c) How many gaps would you expect in an assembly of a 2 Mb target genome size
from an eightfold coverage of fragments with a read length of 500? (d) You want to sequence
a 4 Mb genome by the shotgun method, by assembling random fragments with read length 500.
What coverage would you require, to expect no more than four gaps, assuming no complications
arising from repetitive sequences or far-from-equimolar base composition?
Problem 3.6 Figure 3.25 shows a sequencing gel. (a) What is the sequence of this fragment?
(b) Can you see any self-complementary regions in this fragment that might form hairpin loops?
(c) On the basis of your answer to part (b), would you guess that this region encodes RNA
or protein?
Problem 3.7 Figure 3.26(a) shows a series of measurements from the Solid technology (see
p. 102). Figure 3.26(b) shows the colour coding of the dinucleotides. The known template
sequence implies that the first base is an A, therefore the dinucleotide at positions 0-1 is A-?
(a) What is the sequence of the fragment? (b) Suppose another fragment differs by a SNP at
position 15, and suppose that this SNP is a transition mutation (that is, a purine to the other
purine, or a pyrimidine to the other pyrimidine). How would a figure, corresponding to
Figure 3.26(a), that presents SOLID results from the mutated fragment, differ from
Figure 3.25
Figure 3.26(a)? Autoradiograph of
Problem 3.8 From Figure 3.18, (a) determine the rate of change of the cost of sequencing over a sequencing gel
the years 2005–2007. (b) determine the rate of change of the cost of sequencing over the years (simulated). The shortest
2008–2010. (c) According to these figures, what would be the cost in 2010 of determining a fragment travels the
farthest. In this diagram,
human-sized genome at 10X coverage?
the direction of travel is
Problem 3.9 Many people have asked whether the author is related to Filippo Brunelleschi, down the page.
who was born in 1377, died in 1446 and is buried beneath the cathedral in Florence (the dome
of which he famously created). Assume that you were granted permission to exhume his body
and collect a tissue sample. (a) Estimate the number of generations between Brunelleschi and the
author. (b) Assuming that the author is a direct descendant of Brunelleschi, would you expect
to be able to prove it by DNA sequencing? Explain your answer.
(a)
Read Position 0 1 2 3 4 5 6 7 8 9 10 111213141516171819 20 21222324252627282930 3132333435
Universal seq primer (n)
1
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋ ❋ ❋❋ ❋ ❋
Universal seq primer (n-1)
Primer Round
2
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋
Universal seq primer (n-2)
3
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋
Universal seq primer (n-3)
4
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋ ❋ ❋ ❋ ❋❋
Universal seq primer (n-4)
5
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋
❋ ❋ ❋ ❋
❋ ❋ ❋ ❋
Figure 3.26 (a) a series of measurements from the Solid technology.
❋ ❋ ❋ ❋
(b) The colour coding of the dinucleotides.
114 3 Mapping, Sequencing, Annotation, and Databases
Problem 3.10 You are asked to devise a Master’s degree programme to train annotators for
databases in molecular biology. (a) What background would you require for entry to the
programme? (b) What courses would you require students to take during the programme?
Weblems
Weblem 3.1 Which of the following families have genes appearing in tandem arrays in the human
genome and which have genes dispersed among several chromosomes? (a) Actin, (b) tRNA,
(c) all globins, (d) HOX genes, (e) the major histocompatibility complex.
Weblem 3.2 What institution currently has the highest sequencing throughput power? How
many Gb per week can this institute produce?
Weblem 3.3 For the major companies providing sequencing equipment. What is the current
throughput rate and typical read length of their state-of-the-art instrument?
Weblem 3.4 On a photocopy of Figure 3.18, add points to bring the figure up-to-date.
CHAPTER 4
Comparative Genomics
LEARNING GOALS
• Knowing the three major divisions of living things – archaea, bacteria, and eukaryotes – based
on analysis of the sequences of 16S rRNA genes.
• Recognizing the prevalence of horizontal gene transfer, especially among prokaryotes, and to
understand that horizontal gene transfer is inconsistent with the hierarchical ‘tree of life’ picture
that the Linnaean classification scheme suggests.
• Being familiar with major events in the history of life.
• Appreciating the general distribution of genome sizes and numbers of genes.
• Distinguishing the characteristics of different types of genome organization in viruses,
prokaryotes, and eukaryotes.
• Recognizing the effects of gene duplication on genome evolution.
• Being able to distinguish the meanings of homologue, orthologue, and paralogue.
• Understanding the mechanism of genome change at the levels of individual bases, genes,
chromosome segments, and whole genomes.
• Understanding the limits of what genomes determine and what they do not determine, and the
limits of what we can currently explain on the basis of genetics and what we cannot.
• Appreciating, as far as possible, what makes us human.
• Understanding the idea of a model organism in the study of human disease.
• Appreciating the goals and plans of the Encyclopedia of DNA Elements (ENCODE) project, and
the related project, modENCODE.
116 4 Comparative Genomics
Introduction
It is likely that life originated on Earth about 3.5 billion a human-centred view, we ask how to compare
years ago. The first cellular life forms were undoubt- genomes in a way that illuminates our relationship
edly prokaryotes. Eukaryotes appeared about 2 bil- with other species. In succeeding chapters, we extend
lion years later. There are enough residual similarities this discussion to a more general view of the interac-
among living things to suggest a common ancestor of tion and evolution of genomes.
us all. The great diversity of living forms is, therefore, Other aspects of comparative genomics lie out-
the result of divergence. side the scope of this chapter: (a) Description of
Sequence analysis gives the most unambiguous evid- the variability of genomes within species. We have
ence for the relationships among species. For higher discussed the HapMap project, which is the com-
organisms, sequence analysis and the classical tools of parative genomics of humans. (b) Cancer genomics
comparative anatomy, palaeontology, and embryo- has become a major thrust of research. Goals include
logy usually give a consistent picture. Classification both the study of cancer-related gene variations
of microorganisms is more difficult, partly because it within populations that define risk factors (for
is less obvious how to select the features on which to instance, mutations in the tumour suppressor gene
classify them, and partly because a large amount of BRCA1 and BRCA2 enhance an individual’s likeli-
lateral gene transfer threatens to overturn the picture hood of developing breast and ovarian cancer), and
of the evolutionary tree entirely. the study of genomic changes within single individuals
In this chapter, we discuss general approaches as tumours develop and diverge.
to comparative genomics of different species. From
The diversity of life fascinates everyone. The macro- Common names can cause, or result from, con-
scopic life forms most familiar to us come in discrete fusion. Often settlers on new continents applied
types called species. Linnaeus, an 18th-century common names to animals similar in appearance to
Swedish naturalist, first organized the characteristics familiar ones, but not in fact closely related to them.
of different species into a logical framework. He For instance, early European settlers in Australia
introduced the system of nomenclature still used called a native animal the koala bear, although
today. (We take up biological systematics in more marsupial koalas are not closely related to European
detail in Chapter 5.) bears.
Linnaeus classified living things according to a
hierarchy: kingdom, phylum, class, order, family,
genus, and species. It usually suffices to specify the Table 4.1 Classifications of humans and the fruit fly
lowest two levels, as a binomial: genus and species.
Human Fruit fly
For instance, Homo sapiens = humans or Drosophila
melanogaster = fruit fly (see Table 4.1). Each bino- Kingdom Animalia Animalia
mial uniquely identifies a species that may also be Phylum Chordata Arthropoda
known by one or more common names; for instance, Class Mammalia Insecta
Bos taurus = cow. Conversely, many common names Order Primata Diptera
refer to whole groups of species. For example, there
Family Hominidae Drosophilidae
are many species of whales, not all in the same genus
Genus Homo Drosophila
or even the same family. Of course, most species do
Species sapiens melanogaster
not have common names at all.
Unity and diversity of life 117
Traditional methods for bacterial classification Later, following seminal work by C. Woese, pro-
were based on features of morphology (cell size and karyotic species were defined in terms of variations in
shape), biochemistry (uptake of stains, carbon and 16S ribosomal RNA (rRNA) and other sequences.
nitrogen sources, fermentation products), and physio- Bacteria for which the 16S rRNA sequences are more
logy (growth temperature range and optimum, osmotic than about 2.5–3% different are considered different
tolerance). Gram-positive and Gram-negative bac- species. Typically, this corresponds to no more
teria differ in their ability to take up crystal violet than 70% similarity in overall genome sequence. If
or methylene blue stain: Gram-positive bacteria con- humans and chimpanzees were bacteria, we would
tain a thick (20–80 nm) peptidoglycan layer in their easily be considered as the same species!
cell wall that binds the stain. Immunological cross-
reactivity has also been a basis for classification,
Taxonomy based on sequences
especially among infectious species that elicit a
clinical motivation for their classification. Before Protein, RNA, and DNA sequences have illuminated
sequencing, hybridization of DNA from two different relationships between species, both for macroscopic
bacteria was a criterion for similarity. Most bacterial organisms and microbes. The sequences have clari-
DNAs will form hybrid double-helical structures fied some relationships but have exposed others as
provided that the similarity in base sequence is >80%. simplistic. Major results include the following:
118 4 Comparative Genomics
Thermotoga ❉
Aquifex
❉ Diplomonads
Figure 4.1 Major divisions of the tree of life. Bacteria (blue) and archaea (magenta) are prokaryotes; their cells do not contain nuclei.
Bacteria include the typical microorganisms responsible for many infectious diseases and, of course, Escherichia coli, the mainstay of
molecular biology. Archaea include, but are not limited to, extreme thermophiles and halophiles, sulphate reducers, and methanogens.
We ourselves are eukarya – organisms containing cells with nuclei (green and red). Asterisks mark crucial splitting points (see Exercise 4.4).
This phylogenetic tree was derived by C. Woese from comparisons of ribosomal RNAs. These RNAs are present in all organisms, and
show the right degree of divergence. (Too much or too little divergence and relationships become invisible.) Figure 4.2 shows in more
detail the group that includes us – animals, fungi, and plants (red).
• All life on Earth has enough general similarity to Although archaea and bacteria are both unicel-
show that all life forms had a common origin. lular organisms that lack a nucleus, at the molecu-
Evidence includes the universality of the basic lar level archaea are in some ways more closely
chemical structures of DNA, RNA, and proteins, related to eukarya than to bacteria. It is also likely
the universality of their general biological roles, that the archaea are the closest living organisms
and the near-universality of the genetic code. to the root of the tree of life.
• On the basis of 16S rRNAs, C. Woese divided liv- • Dating of historical events from sequence differ-
ing things most fundamentally into three domains: ences. As species diverge, their sequences diverge.
bacteria, archaea, and eukarya. A domain occupies L. Pauling and E. Zuckerkandl suggested that if
a level in the hierarchy above kingdom. sequence divergence occurred at a constant rate, it
Figure 4.1 shows the major divisions of the tree would provide a ‘molecular clock’ that would allow
of life. At the ends of the eukaryote branch are the dating of the splits in lineage between species.
metazoa, including yeast and all multicellular organ- Although the clock is not universal, judicious
isms – fungi, plants, and animals (see Figure 4.2). calibration of rates of sequence change with palae-
We and our closest relatives are in the vertebrate ontological data permits dating of events in the
branch of the deuterostomes (see Figure 4.3). history of life (see Box 4.2 and Figure 2.15).
Molecular approaches to phylogeny developed against a access to extinct organisms via the fossil record. They can
background of traditional taxonomy, based on a variety of date the appearance and extinction of species by geolo-
morphological characters, embryology, geographical distri- gical methods (see Figure 2.15).
bution and, for fossils, information about the geological Molecular biologists, in contrast, have very limited access
context (stratigraphy). The classical methods have some to extinct species. Some subfossil remains of species that
advantages. Traditional taxonomists have much greater became extinct as recently as within the last two centuries
Unity and diversity of life 119
have legible DNA, including specimens of the quagga the time of divergence of humans from chimpanzees
(a relative of the zebra), the thylacine (Tasmanian ‘wolf’, a at 5 million years ago, based on immunological data. At
marsupial), the mammoth from the permafrost in Russia, that time, traditional palaeontologists dated this split at
the dodo of Mauritius, the ‘elephant bird’ of Madagascar, 15 million years ago and were reluctant to accept the
and some New Zealand birds, for instance, moas. It has molecular approach. Reinterpretation of the fossil record
been possible to sequence mitochondrial DNA from led to acceptance of a more recent split and broke the
∼10 000-year-old remains of an ‘Irish elk’. DNA sequences barrier to general acceptance of molecular methods. It is
from Neanderthal man have been recovered from individu- now generally accepted that human and chimpanzee
als who died approximately 40 000 years ago. But Jurassic lineages diverged between ∼6 and 8 million years ago.
Park remains fiction!
A crucial event in the acceptance of molecular methods
occurred in 1967 when V.M. Sarich and A.C. Wilson dated
Deuterostomes
Vertebrata (human)
Cephalochordata (lancelets)
Urochordata (sea squirts)
Hemichordata (acorn worms)
Echinodermata (starfish, sea urchins)
Bryozoa
Entoprocta
Platyhelminthes (flatworms) Figure 4.2 Phylogenetic tree of metazoa (multicellular
Pogonophora (tube worms)
Lophotrochozoa
Nemertea (ribbon worms) (red) are two major lineages that separated at an early
Annelida (segmented worms)
stage of evolution, estimated at 670 million years ago. They
Protostomes
Echiura
Mollusca (snails, clams, squids) show very different patterns of embryological development,
Sipunculan (peanut worms) including different early cleavage patterns, opposite
Gnathostomulida orientations of the mature gut with respect to the earliest
Rotifera
invagination of the blastula, and the origin of the skeleton
Gastrotricha
Nematoda (roundworms) from mesoderm (deuterostomes) or ectoderm (protostomes).
Ecdysozoa
Echinoderms (starfish)
Cephalochordates (amphioxus)
Amphibians (frog)
Mammals (human)
Figure 4.3 Phylogenetic tree of vertebrates and our closest
Reptiles (lizard)
relatives. Chordates, including vertebrates, and echinoderms
Birds (chicken) are all deuterostomes. Examples of each are shown in blue.
120 4 Comparative Genomics
• The importance of horizontal gene transfer. This in contrast, assumes strict ancestor–descendant
is the acquisition of genetic material by one or- relationships between different organisms during
ganism from another by natural rather than evolution.
laboratory procedures through some means other
than descent from a parent during replication Horizontal gene transfer among different species
or mating (see Box 4.3). Several mechanisms of has affected most genes in prokaryotes. It requires
horizontal gene transfer are known, including a change in our thinking from ordinary ‘clonal’ or
direct uptake, as in Griffith’s pneumococcal trans- parental models of heredity. Microorganisms do not
formation experiments, or via a viral carrier. easily fit into the structure of the ‘tree’ of life but
Arrangements of species into phylogenetic trees, require a more complex organizational chart.
On learning that Streptomyces griseus trypsin is more passed around between bacteria, mitochondria, and
closely related to bovine trypsin than to other microbial algal plastids, as well as undergoing gene duplication.
proteinases, Brian Hartley commented in 1970 that ‘. . . the – many phage genes appearing in the E. coli genome
bacterium must have been infected by a cow’. This was provide further examples and point to a mechanism of
a clear example of lateral or horizontal gene transfer – a transfer.
bacterium picking up a gene from the soil in which it
was growing, that an organism of another species had Nor is the phenomenon of horizontal gene transfer limited
deposited there. The classic experiments on pneumococcal to prokaryotes. Both eukaryotes and prokaryotes are
transformation by Griffiths, and those by O. Avery, C. chimaeras. Eukaryotes derive their informational genes
MacLeod, and M. McCarthy that identified DNA as the primarily from an organism related to Methanococcus, and
genetic material, are another example. their operational genes primarily from proteobacteria, with
Evidence for horizontal transfer includes (1) discrepan- some contributions from cyanobacteria and methanogens.
cies among evolutionary trees constructed from different Almost all informational genes from Methanococcus itself
genes; and (2) direct sequence comparisons between are similar to those in yeast. At least eight human genes
genes from different species. appeared in the Mycobacterium tuberculosis genome.
S. griseus trypsin is an example of eukaryote → prokaryote
• In Escherichia coli, about 25% of the genes appear to transfer.
have been acquired by transfer from other species. The observations hint at the model of a ‘global organ-
ism’, or a genomic World Wide DNA Web from which
• In microbial evolution, horizontal gene transfer is more
organisms download genes at will! How can this be recon-
prevalent among operational genes – those responsible
ciled with the fact that the discreteness of species has been
for ‘housekeeping’ activities such as biosynthesis – than
maintained? We offered the conventional explanation,
among informational genes – those responsible for organ-
that the living world contains ecological ‘niches’ to which
izational activities such as transcription and translation.
individual species are adapted: the discreteness of niches
For example:
explains the discreteness of species. But this explanation
– Bradyrhizobium japonicum, a nitrogen-fixing bacte- depends on the stability of normal heredity to maintain the
rium, symbiotic with higher plants, has two glutamine fitness of the species. Why would the global organism not
synthetase genes: one is similar to those of its bacterial break down the lines of demarcation between species, just
relatives; the other is 50% identical to those of higher as global access to pop culture threatens to break down
plants; lines of demarcation among national and ethnic cultural
– rubisco (ribulose-1,5-bisphosphate carboxylase/ heritages? Perhaps the answer is that it is the informational
oxygenase), the enzyme that first fixes carbon dioxide genes, which appear to be less subject to horizontal trans-
at entry to the Calvin cycle of photosynthesis, has been fer, that determine the identity of the species.
Sizes and organization of genomes 121
There is a general correlation between complexity Caenorhabditis elegans and the fruit fly, many organ-
of organism and amount of DNA per cell. Prokary- isms have even greater amounts than we do. The
otes have less DNA per cell than eukaryotes, and genome of Amoeba dubia is 200 times larger than
yeast has less than mammals. However, although the human genome. The genome of the marbled
humans have more DNA per cell than certain other lungfish (Protopterus aethiopicus), a closer relative,
organisms popular in molecular biology, including is 43 times as large as ours.
Sizes and organization of genomes 123
Why the different amounts of DNA? As far as we by almost an order of magnitude in genome size. It
know, most of the human genome does not encode was also unexpected to find that the worm C. elegans
protein or RNA. Regions of genomes without known appears to have more genes than the fruit fly.
function are often referred to as ‘junk DNA’. Of
course, the fact that we may not know the function
of much of our genome does not mean that it has • Even taking alternative splicing and RNA editing into
account, these figures give only a static idea of prot-
none. (Maybe it is junk, but it is certainly not all
eome complexity. Cells control gene expression pat-
transcriptionally inert. A series of recent discoveries
terns by complex and dynamic regulatory networks.
has revealed many new types of RNA molecules,
Conclusion: it is difficult to correlate numbers of
mostly involved in control processes. It would be expressed genes with organismal complexity if one has
naïve to doubt that many more types will come to no good way of measuring either.
light.) Moreover, the amount of space between genes
affects the rate of crossing over and recombination
and, thereby, rates of evolution. Indeed, the large The phenomena of alternative splicing and RNA
amount of repetitive sequence between our genes editing show the situation to be more complicated
enhances recombination rates by promoting homolo- than simple gene estimates make it appear. This is
gous recombination. Rate of evolutionary change is one reason why it has been difficult to get an accurate
a characteristic of a species that is certainly subject count of the number of genes in humans and other
to selective pressure. Features of the genome that higher organisms. In eukaryotes, estimates of gene
affect rate of evolution cannot be dismissed entirely number refer to maximal sets of exons in units that
as junk. are coordinately transcribed and translated. In fact,
If genome size per se does not single out humans, variation in splicing may create many proteins from
what about numbers of genes? Again there is a gen- each gene. As an extreme example, in the mamma-
eral correlation between complexity of organism and lian immune system, billions of distinct antibodies
estimated numbers of genes. Viral genomes encode arise from regions in the genome containing fewer
only a few proteins. Prokaryote genomes contain than ∼100 exons. (The immune system is special:
hundreds or thousands of genes. The simple eukary- splicing occurs at the DNA, not the RNA, level.)
ote yeast has almost 6000 genes, fewer than twice as RNA editing is the alteration of bases in mRNA,
many as E. coli. Metazoa have tens of thousands of after transcription. The changes are usually either
genes. C→U or A→I (I = inosine, has the coding properties
However, within groups of related organisms, of G). If only some mRNA from the same gene is
including vertebrates, there is no simple correlation edited, an extra degree of variability in the proteins
between apparent complexity of organism, or even arises. Investigation of RNA editing is a relatively
genome size, and numbers of genes (see Table 4.3). new field, and many more implications of the pro-
Two vertebrates, the puffer fish and humans, appear cess in health and disease remain to be revealed.
to have roughly the same number of genes but differ However, it is known that defective RNA editing
contributes to the pathology of sporadic amyotrophic The basis of the complexity of expression patterns,
lateral sclerosis (a neurodegenerative disease, of which metabolic activity, and indeed all other phenotypic
the most famous sufferers have been Lou Gehrig and features is the organization of the genome itself.
Stephen Hawking.) Different types of organism have experimented with
The conclusion is that it is very different to estimate different solutions of the problems of packaging long,
the size – to say nothing of the complexity – of a narrow strands of DNA and of controlling access of
eukaryote’s proteome from its genome. transcriptional machinery to different regions.
Viral genomes
As assembled within the virion, a viral genome may acquired immunodeficiency syndrome (AIDS) and avian
consist of: flu, are usually based on viruses with RNA genomes.)
A viral genome consisting of single-stranded RNA can be:
Nucleic acid Examples (+)sense = same sequence as protein-translatable mRNA
• Single-stranded DNA Bacteriophages fX–174 (−)sense = complementary sequence to mRNA
and M13 ambisense = mixture of both.
• Double-stranded DNA Adenoviruses, smallpox Inside the cell, (+)sense viral RNAs present themselves as
virus, Epstein–Barr virus,
messenger RNA (mRNA) and are translated. (−)Sense viral
bacteriophage l
RNAs and double-stranded viral RNAs require specialized
• Single-stranded RNA Bacteriophages MS2, Q b, polymerases for conversion to mRNA. Retroviral genomes
tobacco mosaic virus, HIV-1
contain (+)sense RNA, which is reverse transcribed into
• Double-stranded RNA Bluetongue virus host DNA. These viral polymerases and reverse transcrip-
tases are proteins that are contained in the infecting virion
Single-stranded DNA viral genomes are generally con- and enter the host cell along with the viral nucleic acid.
verted to double-stranded DNA by the host. Replication Some viral genomes are infectious on their own. For
of RNA viruses is prone to mutation because the error- some RNA viruses, a DNA reverse transcript of the viral
correction mechanisms active in host DNA replication do RNA is infectious (although at a lower rate than the natural
not apply. This helps viruses to evade host immune sys- virion). This permits preparation of large quantities of viral
tems and facilitates their jumping between host species. genomes for vaccines, avoiding the lability and high muta-
(Emerging viral diseases, including but not limited to tion rate of viral RNA replication.
These properties can change as the virus evolves. A Table 4.4 Population in China (millions)
strain that infects animals can potentially become
Year Humans Pigs Poultry
infectious to humans. Species range depends on the
different forms of sialic acid presented on viral glyco- 1968 790 5.2 12.3
proteins. An important determinant is haemagglu- 2005 1300 508 13 000
tinin residue 226, which is Gln in viruses infectious
to birds and Leu in viruses infectious to humans.
Avian flu
• policies of various governments that do not ad-
In 2006, an H5N1 strain of avian flu characterized
equately reimburse farmers who must sacrifice
by very high mortality infected domestic poultry
animals, creating a disincentive to report disease;
in several countries. It is considered a particularly
the result is a delay or even default of an effective
dangerous threat to humans because it has a high
response.
mutation rate and recombines readily. One way for
the virus to jump from birds to humans is for two Increased population densities of both humans and
strains to co-infect pigs and use them as a ‘mixing animals threaten a greater rate of spread of a danger-
vessel’ for recombination. ous strain of virus, even in comparison with the
Avian flu can normally infect only birds and, in recent 1968–1969 epidemic (see Table 4.4).
some cases, pigs. Domestic poultry stocks raised Aggressive approaches to controlling avian flu
under conditions of very high population density are have involved large-scale culling of stocks. In 1997,
particularly vulnerable. Often migratory birds are the H5N1 strain infected poultry in Hong Kong and
carriers but do not get sick. (The 2006 avian flu caused six human fatalities. The entire poultry popu-
strain was spread from southeast Asia to Russia by lation of the island had to be destroyed: 1.5 mil-
migratory birds.) lion birds in three days. An H7N7 epidemic in the
The H5N1 strain prevalent in 2006 was first Netherlands in 2003 led to the killing of >30 million
identified in Hong Kong in 1997 and traced to ducks birds (approximately twice the human population of
from Guandong province. It jumped the species the country). In Asia in 2004, over 100 million birds
barrier to mammals, becoming infectious to pigs, were culled.
in April 2004. It then became supervirulent, killing
rodents, birds, and humans. This H5N1 strain is Drugs against influenza
100% fatal in domesticated chickens and in 54% of Tamiflu (oseltamivir) and Relenza (zanamivir) are
reported human cases. Human to human transmis- the two major drugs against influenza. Both are
sion is uncommon. inhibitors of the viral neuraminidase.
Compared with previous epidemics, the world Relenza (Figure 4.7) was designed at the Common-
today is particularly vulnerable because of: wealth Scientific and Industrial Research Organiza-
tion (CSIRO) laboratory in Melbourne, Australia.
• increased human population densities;
Crystal structures of influenza neuraminidase showed
• widespread long-distance travel; that conserved sequences formed a cavity, suggesting
• intensive livestock production (including antibiotic a target site for drugs. By targeting the active site
feeding, which may create drug-resistant strains of of the enzyme, it is harder for the virus to evolve
infectious bacteria); and resistance.
128 4 Comparative Genomics
O OH
NH O OH
H OH
H2N N
H
O N
H OH
(a) CH3
Arg373 Arg373
Figure 4.7 (a) The structure of the anti-influenza drug zanamivir (Relenza). (b) Zanamivir is a transition-state analogue that binds to the
active site of influenza neuraminidase. Here the atoms of the drug are shown as large spheres and the residues from the neuraminidase
are shown in ball-and-stick representation in stereo.
Ethical dilemma: publication of RNA sequence of the virulent 1918–1919 strain of influenza virus
Recently, scientists were able to recover and sequence the a benefit to the progress of science, and the dangers of
strain of influenza active in the 1918–1919 pandemic. The its misuse. A precedent occurred before the Second
journal Science published the work and, consistent with World War when physicist Leo Szilard tried, unsuccess-
editorial policy, required that the sequence be deposited in fully, to persuade colleagues not to publish results that
the nucleic acid databanks. might prove useful in the development of atomic
In The New York Times on 17 September 2005, R. weapons. He suggested that journals record dates of
Kurzweil and W. Joy wrote an article critical of the decision receipt and acceptance of manuscripts but then sequester
to make the sequence generally available in databanks on the articles for the duration. This occurred well before
the grounds that terrorists might use the information to the strict secrecy imposed after the Manhattan Project
recreate the virus and use it as weapon. was organized.
Reactions to the publication of the reconstructed pan- Science did make an exception to its mandatory-
demic viral sequence illustrate the conflict between the deposition policy in publishing the Human Genome Draft
recognition that free and open access to information is Sequence by J.C. Venter and co-workers in 2001.
Genome organization in prokaryotes 129
Number of genes
of double-stranded DNA 4 639 675 bp long, closed
into a circle. The DNA is supercoiled and associated 1000
with histone-like proteins into a ‘chromosome’,
appearing in a subcellular structure called the nucle-
oid. Some E. coli cells may contain plasmids: short, 500
(a)
Figure 4.9 Map of the genome of E. coli K12. (a) Full view. Red arrows show protein-coding regions of the forward strand. Blue arrows
show protein-coding regions of the reverse strand. Pink arrows show structural RNA-encoding regions of the forward strand. Cyan
arrows show structural RNA-encoding regions of the reverse strand. Radial ticks identify individual gene products, colour coded
according to function. COG categories refer to the Clusters of Orthologous Groups database (http://www.ncbi.nlm.nih.gov/COG/).
(b) Expanded view of the region containing the his operon. The BacMap site provides access to genomes of bacteria and archaea
(http://wishart.biology.ualberta.ca/BacMap/).
Pictures reproduced, by permission, from BacMap: An Interactive Atlas for Exploring Bacterial Genomes. See: Stothard, P., Van Domselaar, G.,
Shrivastava, S, Guo, A., O’Neill, B., Cruz, J., Ellison, M., & Wishart, D.S. (2005). BacMap: an interactive picture atlas of annotated bacterial genomes.
Nucl. Acids Res. 33, D317–D320.
co-transcribed genes have related functions, forming replicated DNA in effect increases the copy number
an ‘operon’. of such genes. Conversely, the half-life of mRNA is
Timing illuminates the interrelationships among only a few minutes. Therefore, translation must over-
these processes. Under ordinary conditions, it takes lap transcription.
E. coli 40 minutes to replicate its genome. The full
generation time between cell divisions is about an
Gene transfer
hour. This explains why genes that require high rates
of expression tend to be near the origin of replica- There are three methods of transfer of DNA between
tion: the availability for transcription of partially prokaryotic cells.
Genome organization in prokaryotes 131
(b)
insD1 [C]
cobT [C]
erfK [C]
nac [C]
BacMap
cobU [C]
Genome Atlas
dacD [C] cbl [C]
yeeY [C] yi213 cobS [C] Genes encoding proteins
b2015 yeeA [C] insH1 [C] Forward strand
yeeX [C]
yeeZ [C] yeeD [C] Reverse strand
yefJ [C] insC1 [C]
rfbX [C] insH1 [C] gyrl [C] b2007 2060 kbp Genes encoding functional RNA
yeeF [C]
glf [C] yefl [C] ugd [C] yeeE [C] Forward strand
rfc [C] b4529 b4528 2070 kbp Reverse strand
wzzB [C]
wcaL [C] rfbD [C] yefG [C] yefJ b2027 yefM [C] COG functional categories
wcaM [C] rfbA [C] wbbJ [C] gnd [C] 2080 kbp
galF [C] rfbC [C] Information storage and processing
rfbB [C] 2090 kbp Translation, ribosomal structure and biogenesis
wcaK [C] Transcription
2100 kbp DNA replication, recombination and repair
• Transformation. The uptake of ‘naked’ DNA, as in Bacterial conjugation has proved very useful in
the experiments of Griffith and of Avery, MacLeod, genome mapping. Transfer of a complete genome
and McCarthy. takes 100 minutes. Interrupting the process at differ-
• Conjugation. Insertion of some or all of the DNA ent times (by physical agitation) results in partial
from one cell into another – the prokaryotic equi- genome transfer. Identifying which genes have
valent of ‘mating’, although there is no meiosis entered the recipient cell after different intervals
or zygote formation. Bacterial conjugation does revealed the order of the genes. Positions in the
permit formation of recombinants. The start point genetic map of E. coli, for example, were classically
for DNA transfer varies with the position in the expressed in minutes. Now, of course, they are speci-
genome of a mobile site. This is not the same as the fied in terms of the DNA sequence itself.
origin of replication, oriC.
• Transduction. Transfer of DNA from one cell to
another via a bacteriophage. During replication in
• Prokaryotes have several mechanisms for sharing
one cell, a phage can pick up fragments of bacterial
genetic material : transformation by naked DNA, con-
DNA and transmit it to another cell subsequently jugation, and transfer via viruses.
infected by progeny virions.
132 4 Comparative Genomics
Genomic information in eukaryotic cells is divided Table 4.6 Numbers of RNAs and proteins encoded in organelle
between the main nuclear genome and cytoplasmic genomes
organelles: mitochondria and chloroplasts.
Organelle RNA encoded Proteins encoded
In the nucleus, DNA is complexed with proteins
to form chromosomes. We have already noted that Animal Two ribosomal RNAs, 12 or 13
chromatin remodelling is an important component mitochondria 22 tRNAs
of regulation of gene expression. DNA in organelles Plant Three ribosomal RNAs, 30–39
mitochondria ∼22 tRNAs
also forms nucleoprotein complexes. These resemble
bacterial nucleoids, reflecting the endosymbiont Chloroplast Four ribosomal RNAs 50–57
(two copies), 37 tRNAs
origin of organelles. Organelle genomes are circular,
double-stranded DNA molecules. Organelles in some
species contain more than one DNA molecule. mitochondrial DNA in nuclear chromosome 2 con-
With a few exceptions, the amount of nuclear tains much of the mitochondrial genome including
DNA per cell is constant in all cells of an organism some duplicated material. (The full mitochondrial
except for gametes. Organelles vary in number of genome is only 366 924 bp long.)
copies of the DNA that they contain. Moreover,
cells of different tissues contain different numbers Photosynthetic sea slugs: endosymbiosis of
of organelles. Mitochondria are more numerous in chloroplasts
cells that consume large amounts of energy, such as
brain, heart, and eye (about 10 000 mitochondria The endosymbiotic origin of mitochondria and plant
per cell), than in skin cells (only a few hundred). chloroplasts is a well-accepted theory about events
In plants, a leaf cell may contain up to 100 chloro- that happened 1–2 billion years ago. Acquisition of
plasts, the number varying among species. Leaves may endosymbiotic chloroplasts by sea slugs is observ-
contain 106 chloroplasts per mm2 of surface area. able today (see Figure 4.10). The slugs, which are
Unsurprisingly, root cells have none. molluscs – i.e. animals – eat algae. They open the
Mitochondrial genomes vary in size among spe- algal cells and discard the contents – including the
cies. Human mitochondrial DNA is 16 569 bp long.
Yeast mitochondrial DNA is 75 kb and that of
plants is considerably larger: the DNA of muskmelon
(cantaloupe) mitochondria is 2.4 Mb! Chloroplast
genomes range from about 110 to 160 kb, larger
than animal mitochondrial DNAs. In some species,
such as the protozoan Cryptosporidium, the mito-
chondria contain no DNA at all!
Mitochondria and chloroplasts carry out their own
protein synthesis. Chloroplasts and plant mitochon-
dria translate their genes according to the standard
genetic code, but animal mitochondria use variants.
There is active traffic between organelle and
nuclear genomes (see Box 4.5). Approximately 90%
of chloroplast proteins are encoded by nuclear genes
Figure 4.10 A lettuce sea slug (Elysia crispata) on a patch of the
and gene transfer is still going on. However, the dif- alga Bryopsis. The slug eats algae, extracts and endocytoses
ferences in genetic code inhibit mitochondrial → the chloroplasts, and then basks in the sun, as in the picture,
nuclear transfer in animals (see Table 4.6). In the while the chloroplasts photosynthesize organic compounds.
Arabidopsis thaliana genome, a 620 kb insertion of Photograph by William Capman, Augsburg College, Minneapolis, MN, USA.
How genomes differ 133
nucleus – except for the chloroplasts. The chloro- genome encodes only 13% of the organelle proteins.
plasts are taken up into host cells, where they carry During the active life of the chloroplast within the
out photosynthesis. The slug can live for months on animal’s cells, its proteins turn over and must be
the molecules synthesized using solar energy. synthesized. Genes from the algal chloroplast have
The mollusc does not get an entirely ‘free lunch’. entered the mollusc nuclear genome and are expressed
In algae typical of the slug’s food, the chloroplast by the host.
Rps14, a protein from the small subunit of the rice mito- result of four single-nucleotide deletions that destroy the
chondrial ribosome, is encoded by a nuclear gene, rps14, reading frame. Certain other higher plants have functional
on chromosome 8. In fact, by alternative splicing, the mitochondrial rps14 genes (broadbean, rapeseed), whereas
five exons of this region (centre strip) encode both Rps14 others resemble rice in containing non-functional mito-
(ribosomal protein 14 of the small subunit) and SdhB (the chondrial rps14 genes, but functional nuclear genes (potato,
B subunit of succinate dehydrogenase). Arabidopsis).
Genes similar to sdhB have not been observed in plant
mitochondrial genomes. It is likely, therefore, that the
rps14
move of the sdhB gene from the mitochondrial to the
nuclear genome is an old event and the move of the rsp14
740 bp 1142 bp gene is a relatively recent one.
genomic DNA
Moving a mitochondrial gene to the nucleus moves the
exon 1 2 3 4 5
site of its expression to the cytoplasm. It needs a leader
sequence containing a proper targeting signal to direct
sdhB
the protein to mitochondria. The protein encoded by the
rice nuclear gene for mitochondrial rps14 appears to have
Both Rps14 and SdhB are synthesized in the cytoplasm borrowed a mitochondrial targeting signal from sdhB,
and transported into the mitochondria. It is likely that both part of an earlier generation of immigrants, by alternative
genes were originally in the mitochondrial genome. A splicing. Compared with products of mitochondrial genes
region similar to the nuclear gene rps14 remains in the rice for Rps14, the nuclear-encoded version has an N-terminal
mitochondrial genome. This mitochondrial gene is trans- extension derived from the sdhB exons. This extension is
lated, but the product has become non-functional as a cleaved off in the mitochondria.
There is a growing consensus that the dynamics of level of whole genomes that have undergone com-
expression patterns embodies the most interesting plete duplications.
features of genomes. It is nevertheless prudent to
begin less ambitiously, with static aspects – the
Variation at the level of individual nucleotides
sequences themselves. Similarities and differences
among genome sequences appear (1) at the levels Closely related genomes tend to contain regions
of individual bases; (2) at the level of genes (see encoding closely related proteins. Alignments of the
Box 2.6); (3) in larger-scale blocks; and (4) at the sequences of homologous genes reveal differences,
134 4 Comparative Genomics
10 20 30 40 50 60
| | | | | |
Human V K Q I ES K T A F Q E A L DA A G D K L V V V DF S A T W C G P C KM I K P F F H S L SE K Y S N _ _ _ V IF
Chicken V K S V GN L A D F E A E L KA A G E K L V V V DF S A T W C G P C KM I K P F F H S L CD K F G D _ _ _ V VF
Neurospora crassa MSDGV K H I NS A Q E F A N L L NT T _ _ Q Y V V A DF Y A D W C G P C KA I A P M Y A Q F AK T F S I P N F L AF
Staphylococcus aureus MA I V K VT D A D F D S K V ES G _ _ _ V Q L V DF W A T W C G P C KM I A P V L E E L AA D Y E G _ _ K A DI
v k F l v v v DF A t W C G P C Km I P l f
Figure 4.11 Thioredoxins are proteins that catalyse disulphide-exchange reactions, contributing to the speed and accuracy of the
protein-folding process. The human thioredoxin gene extends over 13 kb and consists of five exons. This figure shows the alignment
of amino acid sequences of thioredoxins from two vertebrates (human and chicken), a fungus (Neurospora crassa), and a bacterium
(Staphylococcus aureus). Colour coding: green, amino acids with medium-sized and large hydrophobic side chains; yellow, small side
chains; magenta, polar side chains; blue, positively charged side chains; red, negatively charged side chains. Upper-case letters in black
on the line below the sequences indicate amino acids conserved in all four sequences. Lower-case letters in the line below the sequences
indicate amino acids conserved in three of the four sequences.
mostly in the form of single-site mutations or inser- an important mechanism of evolution. They are a
tions and deletions. Typically, there is reasonable prolific source of variation, the raw material of both
correlation between overall species divergence and selection and genetic drift.
divergence of sequences of individual genes and the
corresponding proteins. Comparisons of amino acid
and gene sequences of thioredoxins provide a typical
example (see Figure 4.11). BOX What can happen to a gene?
4.6
Compare these protein sequences with the corres-
ponding gene sequences from human, chicken, and
Staphylococcus aureus (see Figure 4.12). Note the During evolution:
large gap in the bacterial gene corresponding to the 1. A gene may pass to descendants, accumulating
intron in the human and chicken genes. The asterisks favourable (or unfavourable) mutations or drifting
under the sequences indicate positions containing neutrally.
the same base in all three genomes. The colons indi- 2. A gene may be lost.
cate positions containing two identical bases among
3. A gene may be duplicated, followed by divergence
the three; in most cases, two common bases appear
or by loss of one of the pair.
in the human and chicken sequences, even in the
4. A gene may undergo horizontal transfer to an
non-coding regions. Note the frequent occurrence
organism of another species.
of patterns ‘**:’ and ‘**-blank’. What is the likely
reason for this? 5. A gene may undergo complex patterns of fusion,
fission, or rearrangement, perhaps involving regions
encoding individual protein domains.
• In most cases, divergence of the sequences of genes
and proteins correlates well with the divergence of the
species.
Human actgcttttcaggaagccttggacgctgcaggtgataaacttgtagtagttgacttctca
Chicken gctgattttgaggcagaactgaaagctgctggtgagaagcttgtagtagttgatttctct
S.aureus gcagattttgattcaaaagtagaatctggtgtacaa---------ttagtagatttttgg
*:* **** *:: *: *: * :***: *:::* :: :::::::****:** **:*:
Human gccacgtggtgtgggccttgcaaaatgatcaagcctttctttcat---------------
Chicken gccacatggtgtggaccatgtaaaatgatcaagccatttttccatgtaagtagcctttgt
S.aureus gcaacatggtgtggtccatgtaaaatgatcgctccggtattagaa---------------
**:** ******** ** ** *********:::** :* ** :*:
Human --------------gtgagtattaaacaatgtctgctttgtaagagatttgtgttttttg
Chicken ttttcacagtaacagtaagtat-acacaaatacttctgtgcaacttgtcagtaatatg-g
S.aureus ------------------------------------------------------------
:: ::::: : :::: :: :: :: :: : :: : : :
Human agttggtggtcacagtggtaggaaagaaagacagtt----aaaggattttggtttcggtg
Chicken aggaaacctcctttgtctgtggggtggatggtatttcttgaaggagaatttgtagaagta
S.aureus ------------------------------------------------------------
:: : :: :: : : : : :: :: : :: :: ::
Human gg-----gggatttctttggctccatctttggtctaaaagtagtagtataacaaataatt
Chicken tgtgattggtaactattgaataaagtacttggatacacagcagggaacagacatgctgtt
S.aureus ------------------------------------------------------------
: :: : : :: : :::: : :: :: ::: ::
Human taggtttgatacatgtagcccattgaaa-acaaattttagaagttaattttgtcttaaat
Chicken gcattttgtctgtcctggtgctctgatgcacagtctgtaggtacctagcttccctcaaga
S.aureus ------------------------------------------------------------
:::: : : : ::: ::: : ::: : :: :: ::
Human agttctttttttccccacattgaaaca-----tgggcctta--tttgaaatcccagccta
Chicken a---ctggtaacagtggagttgaaacagtgtgtgtacactggctctgattattaaaactg
S.aureus ------------------------------------------------------------
: :: : :::::::: :: : : : ::: : ::
Human gaatttgatatgccaaactgtttt---atactaa--gaaaaatttgatttagagaaaatt
Chicken cattcagataagctgtagcatctttctgtagtgatggggaggctgtaagggaaggaaagg
S.aureus ------------------------------------------------------------
: : :::: :: : : :: :: : : : : : : :: :::
Human tatgtctcttagatctatgt-ctccaaaaga----tctaaatttttggatctttaattag
Chicken cctttcccttgtgcttaggtgcttcagcagactcttccaggggatggagctgaaaattaa
S.aureus ------------------------------------------------------------
: :: ::: :: :: :: :: ::: :: : : : :::::
Human tctctagttttattaagtttccatttaagaagcttaagcttgggtatgttgcattgccat
Chicken tttctggtca-gtaaagtagctgctttaaggtaacgagtcaag---tctcacagcaccag
S.aureus ------------------------------------------------------------
: ::: :: : :::: : :: : :: : : : :: :::
Human tacctagttctaaatctttt-------tggatttttcattttaaattttccag-------
Chicken tttctatgaatgcatcttttaaagaagtggctttcctggagcagtaactacataattttg
S.aureus ------------------------------------------------------------
: ::: : ::::::: ::: ::: : : ::
Human -------------tccctctctgaaaagtattccaac---gtgatattccttgaagtaga
Chicken tttttcattctagagtctgtgtgacaagtttggtgat---gtggtgttcattgaaattga
S.aureus -------------gaattagcagctgactatgaaggtaaagctgacattttaaaattaga
:* : :*: :*:* * : :::*:: : :*: *::** * **
Figure 4.12 Alignment of partial thioredoxin gene sequences from the genomes of human, chicken, and Staphylococcus aureus.
The region shown contains exons 2 and 3 from the vertebrate genes.
136 4 Comparative Genomics
Duplication of genes
Organisms in all three domains of life show duplica-
tion of Ka and Ks involves more than simple counting
tion of individual genes.
because of the need to estimate and correct for pos-
After duplication, both copies of a gene may sur-
sible multiple changes.) The ratio of Ka/Ks distin-
vive and diverge. Alternatively, one copy may turn
guishes the role of selective pressure and drift in the
into a pseudogene or be deleted, leaving only one
divergence of genes after duplication:
functional copy.
As first proposed by S. Ohno in 1970, duplication Ka/Ks ≈ 1 Neutral evolution: silent and
followed by divergence is an important source of substitution mutations have occurred
proteins with novel functions. It is generally easier to approximately equal extents.
to ‘recruit’ and adapt an already active molecule to Ka/Ks >> 1 Positive selection: substitution
a new function than to invent a new protein from mutations are more prevalent than
scratch. The course of evolution of proteins descended silent mutations, implying that
from a common ancestor will differ, depending on selective pressures are active and
whether they are retaining or changing their function the substitutions are advantageous.
(see Box 4.7). Ka/Ks << 1 Purifying selection: substitution
mutations are underrepresented,
‘Walls supply stones more easily than quarries, and palaces implying that the sequence is optimized
and temples will be demolished to make stables of granite, fairly rigidly, with relatively little
and cottages of porphyry.’ – Johnson, Rasselas.
tolerance for mutation.
400
Haemoglobin
Neuroglobin
600
Cellular globin
800
1000
Ancestral
globin
Figure 4.13 Duplication and dispersal through the genome of globin genes during animal evolution.
From: Burmester, T., Ebner, B., Weich, B., & Hankeln, T. (2002). Cytoglobin: a novel globin type ubiquitously expressed in vertebrate tissues.
Mol. Biol. Evol. 19, 416–421.
The many globins in the human genome provide a Within mammals, the a- and b-globin regions are
good example of gene duplication and divergence quite variable in content and extent (see Figure 4.14).
(see Figure 1.11). Genes for several versions of the Even within the primate lineage, we can date a dupli-
haemoglobin a and b chains form clusters on chro- cation of the g-globin gene (see Figure 4.15).
mosomes 16 and 11. Other, isolated, loci contain
genes for neuroglobin, cytoglobin, and myoglobin. Duplication can affect individual exons
Closely linked genes, as in the a- and b-globin regions, Fibronectin, a large extracellular protein involved in
suggest relatively recent divergence. Yet even within cell adhesion and migration, is a modular protein
these clusters, the proteins encoded have diverged in (see Box 4.8) containing multiple tandem repeats of
function, showing small but significant variations in
oxygen affinity and responses to allosteric effectors.
Also, these proteins appear at different stages of our BOX Modular proteins
development, implying divergence in the control of 4.8
their expression.
We can date the globin duplications by looking A modular protein contains a linear string of compact
back into evolutionary history (see Figure 4.13). units, called domains. Domains appear to have inde-
Neuroglobin split off from other globins before the pendent stability and can be ‘mixed and matched’ with
last common ancestor of the vertebrates, perhaps 109 one another in different proteins. Domains sometimes,
years ago. The divergence of myoglobin and cyto- but by no means always, correspond to single exons.
globin from haemoglobin occurred before the Modular proteins are common in eukaryotes. Indi-
emergence of the jawless fishes, during the Cam- vidual domains of eukaryotic modular proteins are often
brian about 500 million years ago. The divergence of separately homologous to single-domain prokaryotic
a- and b-globins occurred early in the vertebrate proteins (and, less-commonly, vice versa).
lineage, approximately 450 million years ago.
138 4 Comparative Genomics
Zebrafish
Xenopus
Chicken
Kangaroo
Rabbit
Mouse
Rat
Goat
Sheep
Cow
Galago
Human
Figure 4.14 Layout of the b-globin locus in selected vertebrates. Colour coding: Light brown, fish; green, amphibian; purple, avian;
magenta, marsupial; dark brown, e-like; dark blue, g-like; orange, d-like; cyan, b-like; white, rat-specific pseudogene; red, h-like.
The zebrafish and Xenopus regions illustrate the organization of the region prior to the separation of a- and b-globins. They alone
of the species illustrated here contain only a- and b-globin genes.
Goat and sheep are closely related species that show similar patterns. Rat and mouse are closely related species that show different
patterns.
After: Aguileta, G., Bielawski, J.P., & Yang, Z. (2004). Gene conversion and functional divergence in the β-globin gene family. J. Mol. Evol. 59, 177–189.
Other mammals
three types of domain called F1, F2, and F3. It is a
linear array of the form: (F1)6(F2)2(F1)3(F3)15(F1)3
ε γ ψ δ β
ε γ η δ β Prosimians (see Figure 4.16). In the human genome, each domain
Early placental of fibronectin is encoded by either one or two tandem
mammal exons.
ε γ1 γ2 ψ δ β
New World monkeys Fibronectin domains also appear in other modular
proteins. The duplication of the exon(s) encoding a
ε γ 1 γ2 ψ δ β domain, followed by transfer to another protein, is
Old World monkeys: called ‘exon shuffling’.
including apes, chimps, and humans
Figure 4.15 Evolution of primate b-globin region (not drawn Family expansion: G-protein-coupled receptors
to scale). Red boxes, embryonically expressed genes; green
boxes, post-embryonically expressed genes; blue boxes, foetally
Repeated duplications can generate large numbers of
expressed genes; black boxes, pseudogenes. Mammals ancestral homologues. G-protein-coupled receptors (GPCRs)
to the groups in this chart had an embryonic e-globin and a are a large superfamily of eukaryotic cell-surface
post-embryonic b-globin gene. Marsupials retain this pair
receptors active in signal recognition and processing,
(see Figure 4.14). By the time this diagram begins, the e gene
had duplicated to form three embryonic genes, e, g, and h, and including the senses of sight, taste, and smell.
the b gene had duplicated to form d and b. The h gene fell into The human genome contains about 700 active
desuetude, mutating into a pseudogene. Subsequent duplication GPCRs. They are integral membrane proteins with a
of the g gene in anthropoids produced g1 and g2, and a change in
the control of expression to convert the g genes from embryonic common structure comprising seven transmem-
to foetal expression. brane helices (see Figure 4.17). GPCRs interact with
How genomes differ 139
Figure 4.17 G-protein-coupled receptors (GPCRs) are a large family of transmembrane proteins involved in signal transduction into
cells. They share a substructure containing seven transmembrane helices, arranged in a common topology. This figure shows the first
experimentally determined mammalian GPCR structure, bovine opsin [1H68]. This molecule senses light and generates a nerve impulse.
The seven-helical structure is common to the family of GPCRs. The helices traverse the membrane, with loops protruding outside and
inside the cell. This figure shows a view parallel to the membrane, with the extracellular side at the top. The transmembrane region is
generally flanked by N- and C-terminal domains. The N-terminal domain is always outside the cell and the C-terminal domain always
inside.
GPCRs constitute the largest known family of receptors. The family is as old as the eukaryotes and is large and diverse. Mammalian
genomes contain ∼1500–2000 GPCRs, accounting for about 3–5% of the genome. A similar fraction of the C. elegans genome codes
for GPCRs.
Some GPCRs are involved in sensory reception, including vision, smell, and taste. Some, like opsin and bacteriorhodopsin, bind
chromophores. (Bacteriorhodopsin is not a signalling molecule but a light-driven proton pump.) Others respond to extracellular ligands
including hormones and neurotransmitters.
As expected from the structure, in many groups of GPCRs the sequences of the helical regions diverge less than the sequences of the
loops. It is the loops that determine the specificity of the ligand, and of the G-protein partner.
The common mechanism of function of GPCRs is a conformational change, induced by receptor binding or light absorption. The
activated state of the GPCR interacts with an intracellular G protein, triggering a signal cascade. As there are substantially more GPCRs
than G proteins, many GPCRs must interact with a single G protein. For instance, all odorant receptors interact with the same G protein
a-subunit.
GPCRs are the targets for many drugs used in the treatment of high blood pressure, asthma, allergies, and other conditions. The large
number of related GPCRs is a challenge to the design of drugs that bind to a unique target. Many drugs have undesired side effects
because of imperfect specificity.
disease, by presenting sites for homologous recom- different genes diverge at different rates, and homo-
bination during meiosis, which show up in some of logues may be under different selective pressures.
the gametes as deletions. It is, therefore, likely that Clock arguments are, therefore, relatively weak.)
Prader–Willi and Angelman syndromes are less com- The yeast genome underwent a duplication about
mon in chimpanzees than in humans. 108 years ago. The effects are obscured by subsequent
chromosomal rearrangements and by massive loss
Whole-genome duplication of duplicated material. The duplication has neverthe-
Genomes can duplicate if the chromosomes replicate less left its traces in multiple homologues that retain
but do not segregate properly into separate progeny their genomic order. The yeast genome contains 55
cells upon mitosis. duplicated regions, on average 55 kb long, together
The mere appearance of two copies of many genes covering ∼50% of the genome and including 376
does not prove whole-genome duplication. One must pairs of homologous genes.
adduce (1) the genome-wide occurrence of pairs of In a seminal 1970 book, Evolution by Gene Dupli-
homologous genes appearing in the same order; or cation, S. Ohno proposed that the vertebrate genome
(2) ‘molecular clock’ evidence showing equal diver- is the product of one or more complete genome
gence times in many pairs of homologues. (However, duplications. Genome sequences confirm his prescient
How genomes differ 141
insight. Nor are whole-genome duplications only contain multiple copies of genomes from the same
limited to vertebrates. They have occurred frequently parent. Allopolyploids contain multiple copies of
in plant lineages (see p. 219). genomes from different parents. Many crop species
In the lineage leading to vertebrates, the genomes are polyploids, relative to the wild species from which
of the cephalochordate Amphioxus (Branchiostoma they were domesticated, including wheat, alfalfa,
floridae), and the urochordate Ciona intestinalis, oats, coffee, potatoes, sugar cane, cotton, peanuts,
showed evidence for two rounds of whole genome and bananas. Often polyploidy increases the size of
duplication. Individual genes in these relatives corres- the fruit or grain, a useful property for agriculture
ponded to multiple genes in vertebrates, with enough (see Box 4.9).
synteny preserved to show that the process happened
in parallel on a large scale. It appears that two whole
gene duplications in the vertebrate line occurred
after the split between the primitive chordate rela- BOX Polyploidy in wheat
4.9
tives, urochordates and cephalochordates, from verte-
brates, about 400–600 million years ago. More
recently, a third duplication occurred in the lineage The wheat first used in agriculture, in the Middle East
leading to ray-finned fishes, such as zebrafish and at least 10 000–15 000 years ago, is a diploid called
medaka. (See p. 222.) einkorn (Triticum monococcum), containing 14 pairs
What happens to all those extra genes? Most meta- of chromosomes. Emmer wheat (T. dicoccum), also
zoa have roughly the same number of protein-coding cultivated since palaeolithic times, and durum wheat
genes, in the range 20 000–25 000. Whole-genome (T. turgidum), are merged hybrids of relatives of einkorn
with other wild grasses to form tetraploid species.
duplication is followed by massive gene loss. Some
Additional hybridizations, to different wild wheats, gave
genes do form paralogous groups, and can even
hexaploid forms, including spelt (T. spelta) and modern
duplicate further, individually, creating gene and pro-
common wheat (T. aestivum). Triticale, a robust crop
tein families of various sizes. The globins and GPCRs
developed in modern agriculture and currently used pri-
are examples.
marily for animal feed, is an artificial genus arising from
It is interesting to see which kinds of genes do take
crossing durum wheat (T. turgidum) and rye (Secale
advantage of the duplication. The comparison of the cereale). Most triticale varieties are hexaploids.
genome of the primitive chordate Branchiostoma
floridae with genomes of vertebrates shows that the Variety of Classification Chromosome
set of duplicates that is retained after whole-genome wheat complement
duplication is enriched in genes for signal transduc-
Einkorn Triticum monococcum AA
tion, transcriptional regulation, neuronal activity,
Emmer wheat Triticum dicoccum AABB
and development. Precisely the features with which
Durum wheat Triticum turgidum AABB
early chordates were experimenting.
Spelt Triticum spelta AABBDD
Common wheat Triticum aestivum AABBDD
• Evolution subscribes to the advice, attributed to Yogi Triticale Triticosecale AABBRR
Berra: ‘When you come to a fork in the road – take it.’
We may even add: if you don’t come to a fork in the A, genome of original diploid wheat or a relative; B, genome
of a wild grass, Aegilops speltoides or a relative; D, genome
road – take it anyway.
of another wild grass, T. tauschii or a relative; R, genome of
rye S. cereale.
Plant genomes are very susceptible to duplication. All of these species are still cultivated – some to only
The sequence of the Arabidopsis genome reveals at minor extents – and have their individual uses in cook-
least two and possibly three successive duplication ing. Spelt, or farro in Italian, is the basis of a well-known
events. soup; pasta is made from durum wheat; and bread is
Most plants are polyploids, i.e. they contain made from T. aestivum.
multiple sets of entire chromosomes. Autopolyploids
142 4 Comparative Genomics
I II III IV V
Weinberg J. et al = heterochromatin
Chrom.Res. 2(1994):405-410 Marzella R. et al
Muller S. et al Marzella R. et al Cytog.Cell Genet Stanyon R. et al.
PNAS 97(2000):206-211 Genomics 63(2000): 77(1977):232-237 Am.J.Ph.Anthr. 88
307-310 (1992):245-250
IIp
17
IIq 5
Archidiacono N. Archidiacono N.
personal data personal data
ANC. ANC.
XIII XIV XV XVI XVII XVIII
Figure 4.18 Top: photograph of banding patterns. Bottom: ideograms. HSA, Homo sapiens; PTR, Pan troglodytes (chimpanzee); GGO,
Gorilla gorilla; PPY, Pongo pygmaeus (orang-utan).
Photographs courtesy of Prof. M. Rocchi, Università di Bari, Italy.
What makes us human? 143
Polyploidy may have other advantages. In studies the number of polyploid cells in the liver increases
of Arctic flora, it is observed that the fraction of dip- with age, or in response to disease or surgery even in
loid and polyploid plant species increases towards children. This may be a defence against oxidative
higher latitudes. Many arctic plants tend to exist stress. In the bone marrow of mammals, very large
in small, separated populations and frequently go polyploid cells called megakaryocytes ‘bud off’ por-
through ‘bottlenecks’ of marginal survival, for tions of their cytoplasm to form platelets. (Platelets
instance during glaciations. After recession of the ice, are enucleate cells in the blood involved in clotting.)
deglaciated areas may be repopulated by a few or In a related condition, called polyteny, replicated
even one dispersed seed. Carrying many copies of the chromosomes remain in alignment rather than separ-
genome in the cells of each individual may help to ate as in polyploids. This is the origin of the giant
preserve genetic diversity, even in tiny populations. salivary gland chromosomes of Drosophila, which
Although polyploidization is much more common played such an important role in the history of cyto-
in plants than in animals, related species of frogs genetics (see Figure 3.3).
(genus Xenopus) are diploid, tetraploid, octaploid,
and dodecaploid. One tetraploid mammal is known,
Comparisons at the chromosome level: synteny
the rat Tympanoctomys barrerae from the Monte
Desert in west-central Argentina. These species pro- Comparison of chromosome banding patterns provides
vide a model for control of expression of duplicated snapshots of similarities and differences in large-scale
genes. For example, in ‘polyploid’ frogs, silencing organization among eukaryotic genomes. Synteny
is non-syntenic. Each copy of the genome contains literally means ‘on the same band’, that is, on the
some expressed and some silenced genes. This is a same chromosome. (The chromosome exchange that
different model from the silencing of an entire X causes chronic myeloid leukaemia (see Chapter 3) is
chromosome in cells of mammalian females (see a breaking of synteny.) Closely related species gener-
Chapter 1). ally show a correspondence between large syntenic
There are a number of examples of tissue-specific blocks. The similarity of the banding patterns reveals
polyploidization. The endosperm of maize kernels the underlying similarity of the patterns in the DNA
undergoes repeated cycles of endoreplication (replica- sequences themselves.
tion of nuclear DNA in the absence of mitosis) to Figure 4.18 shows the relationships among the
produce cells that can have as many as 96 copies of karyotypes of human, chimpanzee, gorilla, and
the haploid genome. In mammals, it is observed that orang-utan.
It is too difficult to look only at the human genome Understanding the effects of mutations both illu-
and try to deduce . . . ourselves. Two approaches help minates human biology and, often, has immediate
in understanding the genome. clinical applications.
• Comparative genomics. We can compare the
human and chimpanzee genomes and ask how dif- Comparative genomics
ferences between these genomes might give rise to The human and chimpanzee genomes are about 96%
differences between the species. identical. To understand what makes us human – or
• Study of human disease. Many mutations cause dis- at least what makes us not chimpanzees – we can
ease and give clues to the functions of the affected focus on 13 Mb of different sequence, rather than the
regions. These regions may encode enzymes or full 3.2 billion. There are even fewer differences in
regulatory proteins or RNAs, or they may be DNA our amino acid sequences. Humans and chimpanzees
sequences that are targets of regulatory mechanisms. express very similar sets of proteins, and most of the
144 4 Comparative Genomics
homologous proteins of chimpanzee and human are Many people suffer from diseases that interfere
identical or very similar. About 30% of homologous with production or comprehension of language, or
human and chimpanzee proteins show no differences both. Some of these are associated with trauma,
at all. On average, there are only two amino acid or with complex genetics. However, one abnormality
differences. with simple Mendelian inheritance appears in a
How, then, do humans and chimpanzees develop family in London. Members of the ‘KE’ family have
differently? The ultimate answer must lie within the a severe disorder affecting both the facial motor con-
static sequence of the genome. However, a satisfactory trol involved in producing speech and also the mental
answer will require understanding of the dynamics, processing of language.
specifically of patterns of regulation of gene expression. The mutation responsible for this condition has
There is a paradox here. On the one hand, living been identified. It is a single-nucleotide polymor-
systems are fairly robust to perturbations. Yeast, for phism (SNP) in a gene called FOXP2, which encodes
example, survives individual knockout of 80% of its a transcription factor. The major protein encoded by
genes. On the other hand, the 4% differences between FOXP2 is 715 amino acids long. It is quite a stable
chimpanzee and human genomes make profound dif- protein from the evolutionary point of view, with
ferences in phenotype. This suggests a chaotic system, only one substitution between mouse and the identical
one in which tiny perturbations can lead to large chimpanzee, rhesus macaque, and gorilla sequences.
changes in the subsequent trajectory. Superposed on However, the human protein has two mutations
the robustness are specific changes that exert immense relative to the other primate sequences.
leverage. This example illustrates the power of a combination
There are two ways to find these crucial sequences. of studies of human phenotypes and comparative
One is to look closely at the differences between sequence analysis. And yet the observation that the
human and chimpanzee genomes and try to figure phenotype shown by members of the KE family
out what the changed loci are doing. Another is to arises from two SNPs in a single gene may be
examine human mutations that affect phenotypic deceptively simple. We cannot conclude that the
properties that chimpanzees do not share with humans, expression of only one gene is involved in creating
such as language and reasoning. the phenotype, as is the case for phenylketonuria,
for example. The FOXP2 gene product is a transcrip-
tion factor. Its activity affects the expression of many
Combining the approaches: the FOXP2 gene
genes. The effectiveness of their coordinated expres-
Language is a unique feature of our species. It should sion required co-evolution, i.e. sequence changes in
show up as a genetic difference between our genomes other genes.
and those of other species, including chimpanzee.
The genome sequence of Clint, a male chimpanzee • The sequences of the alignable regions differ at
from the Yerkes National Primate Research Center at 1.23% of the positions. Recognizing that there is
Emory University, in Atlanta, Georgia, USA, was intraspecies divergence among humans and among
reported in 2005. Clint represented the West African chimpanzees, it is likely that the true interspecies
subspecies Pan troglodytes verus. difference amounts to about 1% of the alignable
As expected from so closely related a species: sequence.
• There is close alignment of the genome: 96% is • The 4% non-alignable regions represent insertions
alignable with the human genome. and deletions. It is estimated that about 45 Mb of
Genomes of mice and rats 145
human sequence do not correspond to chimpanzee • Although most proteins are very similar, a few
sequence and a similar amount of chimpanzee show large Ka /Ks ratios, suggesting that they are
sequence does not correspond to human sequence. under positive selection. These include two pro-
More positions in the genomes differ as a result of teins, glycophorin C and granulysin (involved in
insertions/deletions than differ as a result of base combating infection) and other proteins involved
substitutions. in reproduction. (Selection can act most directly
• The distribution of differences is variable across on reproduction itself, producing high rates of
the genome. For all syntenic 1 Mb segments across evolutionary change.)
the genomes, the range in difference is about • Changes in gene expression patterns show that genes
0.005–0.025%. Looking at the distribution with active in the brain have changed more rapidly in
respect to the chromosomes, divergence tends to be humans.
higher near the telomeres. The divergence is lowest
for the X chromosome and highest for the Y. Sadly and unexpectedly, Clint died a few weeks before
• The proteins encoded are also very similar in the paper on his genome was submitted to Nature. He
sequence. Of 13 454 orthologous proteins, 29% was 24 years old. Even in the wild, chimpanzees can live
for 40–45 years. Cheeta, a chimp that appeared in Tarzan
have identical sequences. On average, there are
movies in the 1930s, is almost 80 years old.
one to two amino acid residue differences between
corresponding chimpanzee and human proteins.
The mouse and rat are by far the most common years ago. The genomes of all three species are
mammalian laboratory animals. Knowledge of these approximately the same size. The rat genome is ∼5%
species and correlation with human biology is ency- smaller than the human genome. The mouse genome
clopaedic. Determination of mouse and rat complete is about ∼15% smaller than the human genome.
genome sequences was clearly a high-priority goal. Sequence divergence and chromosome segment rear-
The genome of the laboratory mouse (Mus musculus) rangement appear to have been faster in the rodent
appeared in December 2002. The genome of the lineage. The human genome shows more duplication
brown or Norway rat (Rattus norvegicus) appeared – one reason why it is larger.
in April 2004. Human, mouse, and rat genomes encode similar
Mice, rats, and humans are closely related mam- numbers of genes. Most proteins have homologues
mals and illuminate one another. Laboratory studies in all three species, with very similar amino acid
on rodents are useful guides to the biochemistry and sequences (see Figure 4.19). (Transgenic animals –
molecular biology of humans. Mice and rats provide substituting rodent genes with human genes – can
the first test of tolerance to, and effectiveness of, equip model organisms with exact human sequences if
novel drugs aimed ultimately at human therapy. Out- necessary.) The genes for most rodent–human homo-
side the laboratory, however, the close relationship logues have a common exon–intron structure. Gene
has been tragic for humans: shared parasites permit duplications create protein families, which may be of
rats to transmit disease. (An epidemic of bubonic different sizes in different species. For instance, con-
plague in 1347–1352 killed a third of the population sistent with their greater dependence on a sense of
of Europe.) Shared diets make food supplies vulner- smell, rodents have more odorant receptors than we do.
able to rodent infestation. Some genomic variation is observable at the chro-
The last common ancestor of humans and rodents mosomal level. The mouse has 19 chromosomes,
lived approximately 75 million years ago. Rats and plus X/Y. The rat has 20 chromosomes, plus X/Y.
mice separated much more recently: 12–24 million Synteny between the mouse and rat genomes is high.
146 4 Comparative Genomics
Cytochrome c
10 20 30 40 50 60
| | | | | |
Human G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G P N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I I W G
Mouse G D V E K G K K I F V Q K C A Q C H T V E K G G K H K T G P N L H G L F G R K T G Q A A G F S Y T D A N K N K G I T W G
Rat GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAAGFSYTDANKNKGITWG
GDVEKGKKIF KC QCHTVEKGGKHKTGPNLHGLFGRKTGQA G SYT ANKNKGI WG
70 80 90 100
| | | |
Human E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E E R A D L I A Y L K K A T N E
Mouse E D T L M E Y L E N P K K Y I P G T K M I F A G I K K K G E R A D L I A Y L K K A T N E
Rat EDTLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE
EDTLMEYLENPKKYIPGTKMIF GIKKK ERADLIAYLKKATNE
Figure 4.19 Alignment of the amino acid sequences of cytochrome c from human, mouse, and rat. All have 104 residues. The mouse
and rat sequences are identical. Letters below the sequences show residues conserved in all three species. The human sequence shows
nine substitutions, most of which are conservative. For colour coding, see Figure 4.11.
Synteny between the mouse and human genomes is and among chromosomes make it impossible to
variable. Most human chromosomes contain many align the three genomes as one long linear sequence.
small blocks that correspond separately to regions However, at the nucleotide level, 40% of human
distributed among several mouse chromosomes. and mouse genomes are block-alignable, about 1 Gb
However, almost all of human chromosome 20 cor- in all.
responds almost continuously with a region of mouse The differences among the human, mouse, and rat
chromosome 2. Almost all of human chromosome genomes arise from selection or neutral drift. Regions
X appears in mouse chromosome X, differing by changing under selection rather than drift are likely
rearrangement of nine contiguous blocks, including to be functional. This is a powerful way to search
some reversals. Almost all of human chromosome 17 for regulatory regions, which are harder to identify in
appears within a region of mouse chromosome 11; genomes than protein-encoding genes.
however, the sequence is broken into 16 segments Genomics confirms the utility of the rat and
that are rearranged, including some reversals. mouse for clinical research. Of a set of ∼1000 genes
The blocks are identified by matching sequences of for which known mutations are associated with
genetic markers. Even though most of the correspon- human disease, almost all have homologues in
dences are distributed among different chromosomes, rodents. In certain interesting cases, the sequence of
most of the genomes can be partitioned into syntenic the human disease-associated mutant is identical to
blocks, making up a total of 2.35 Gb (over 90% of the mouse and rat wild type. There has probably
the mouse genome). These regions include almost all been co-evolution, with a compensatory change in
known exons and regulatory regions. some other gene or genes. This can be a source
Alignment of genetic maps is less stringent than of clues to the function and interactions involved in
alignment of sequences. The rearrangements within the disease.
Above the molecular level, the differences between The underlying common features of the structure,
humans and flies and between humans and worms organization, and development of different species is
are more obvious than the similarities. At the bio- of both academic interest and practical importance, as
chemical and genomic level, the situation is reversed. flies and worms provide models for human diseases.
Model organisms for study of human diseases 147
From: The C. elegans Sequencing Consortium (1998). Genome sequence of the nematode C. elegans: a platform for investigating
biology. Science 282, 2012–2018.
The genome of Caenorhabditis elegans non-compact form, containing most of the active
genes. The euchromatic portion, about 120 Mb,
The nematode worm Caenorhabditis elegans entered
was the first segment of the sequence released. The
molecular biology in 1963, at the invitation of Sydney
other one-third of the Drosophila genome appears as
Brenner. Brenner recognized its potential as a suffici-
heterochromatin, highly compact regions flanking
ently complex organism to be interesting but simple
the centromeres. Heterochromatin contains many
enough to permit complete analysis of its develop-
tandem repeats of the sequence AATAACATAG and
ment and neural circuitry, at least at the cellular level.
relatively few genes.
The C. elegans genome, completed in 1998, was the
The genome is distributed over five chromosomes:
first full DNA sequence of a multicellular organism.
three large autosomes, a tiny chromosome containing
C. elegans contains ∼97 Mb of DNA distributed on
only ∼1 Mb of euchromatin and an X/Y chromosome
six paired chromosomes (see Box 4.10). There is
pair, of which the Y chromosome is heterochromatic
an X but no Y chromosome: different genders in C.
and relatively gene poor. The fly’s ∼14 000 genes are
elegans are a self-fertilizing hermaphrodite, genotype
approximately double the number in yeast, but fewer
XX, and a male, genotype XO (i.e. a single unpaired
than in C. elegans, perhaps a surprise. The average
X chromosome).
density of genes in the euchromatin sequence is 1
The C. elegans genome is about eight times larger
gene/9 kb; about half that of C. elegans (see Box 4.11).
than that of yeast, and its 19 099 predicted genes
The genes of the metacentric chromosomes 2 and 3
are approximately three times the number in yeast.
are reported separately for the two arms, arbitrarily
Exons cover ∼27% of the genome. The genes contain
designated left (L) and right (R). The other chromo-
an average of five introns. The gene density is rela-
somes are telocentric (see Figure 4.20).
tively low, for a eukaryote, with ∼1 gene/5 kb of
Determination of the D. melanogaster genome
DNA. Approximately 25% are in clusters of related
sequence was a collaboration between industry (Celera
genes.
Genomics) and the academic Drosophila Genome
Projects based in Berkeley, California, USA, and in
The genome of Drosophila melanogaster Europe. The project was a methodological testbed.
The total chromosomal DNA of Drosophila melano- • First, it showed that a relatively large eukaryotic
gaster contains about 180 Mb. Approximately two- genome could be completed by the method of
thirds is euchromatin, a relatively uncoiled and whole-genome shotgun sequencing (see Chapter 3).
148 4 Comparative Genomics
Cytochrome c 10 20 30 40 50 60
| | | | | |
Human G D V EKG K K I F I M K CS Q C H T V E K G GK H K T G P N L H GL F G R K T G Q A PG Y S Y T A A N K NK
D. melanogaster (isoform 1) GSG D A ENG K K I F V Q K CA Q C H T Y E V G GK H K V G P N L G GV V G R K C G T A AG Y K Y T D A N I KK
D. melanogaster (isoform 2) GVPAG D V EKG K K L F V Q R CA Q C H T V E A G GK H K V G P N L H GL I G R K T G Q A AG F A Y T D A N K AK
C. elegans SDIPAG D Y EKG K K V Y K Q R CL Q C H V V D S _ TA T K T G P T L H GV I G R T S G T V SG F D Y S A A N K NK
G D E G K K C Q C H K G P L G G R G G Y A N K
70 80 90 100 110
| | | | |
Human G I I W G E D T LME Y L E N P K K YI P G T K M I F V GI K K K E E R A DLI A Y L K K A T NE_ _
D. melanogaster (isoform 1) G V T W T E G N LDE Y L K D P K K YI P G T K M V F A GL K K A E E R A DLI A F L K S N K ___ _
D. melanogaster (isoform 2) G I T W N E D T LFE Y L E N P K K YI P G T K M I F A GL K K P N E R G DLI A Y L K S A T K__ _
C. elegans G V V W T K E T LFE Y L L N P K K YI P G T K M V F A GL K K A D E R A DLI K Y I E V E S AKS L
G W LEY L P K K YI P G T K M F G K K E R DLI
Figure 4.21 Alignments of the amino acid sequences of cytochrome c from human, D. melanogaster (two isoforms) and C. elegans.
For colour coding, see Figure 4.12.
genetically) in the laboratory, has a short generation diseases. Some of these homologues have different
time, is safe to humans, and comes with an extensive functions in humans and flies. Other human disease-
knowledge of its biology. As each model organism associated genes can be introduced into, and studied
has its own strengths and limitations, different organ- in, the fly. For instance, the gene for human spino-
isms are useful for different investigations. cerebellar ataxia type 3, when expressed in the
In principle, biologists will use the simplest organism fly, produces similar neuronal cell degeneration.
that illustrates the human feature of interest. It is There are now fly models for Parkinson’s disease and
easier to do experiments with yeast than with fruit malaria.
flies. However, sometimes there is no choice. The C. elegans also provides human disease models.
only animal other than humans that is susceptible Mutations in the human gene for presenilin-1 (PS1)
to leprosy is the armadillo, which satisfies few, if any, are associated with familial early-onset Alzheimer’s
of the criteria of an ideal laboratory organism. disease. Mutations in the homologous gene in C.
There are two ways in which model organisms elegans, sel-12 (Figure 4.22) do show neurological
can contribute to understanding and treatment of defects, but in only a few neurons. Mutants do show
human disease. The first is to observe homologues in more profound defects in egg laying, but this may be
the model organisms of genes implicated in human a secondary effect.
diseases. One can then study the effect on the model Although there are greater differences between
organism of mutation or knockout of the homo- the nervous systems of humans and C. elegans than
logues. The second is to introduce a human gene into between their machineries for respiratory energy
a model organism and discover its phenotypic effect. transduction, the difference between the homologues
A model animal containing an active human gene shown here is not greater than the difference between
makes it possible to screen libraries of compounds the cytochrome c proteins. The relationship between
for potential drugs. sequence and function in proteins is full of surprises.
Table 4.8 shows some of the human disease-
associated genes with homologues in D. melanogas-
ter, C. elegans, and Saccharomyces cerevisiae. The • We have discussed selected genomes – chimpanzee,
database Homophila provides links between human mouse, rat, worm, and fly – the first because it the
disease-associated genes and Drosophila homologues. closest extant relative we have, and the others because
Despite the fact that insects are not very closely of their importance as laboratory animals. Two aspects
related to mammals, fruit flies are useful in the study of comparative genomics are to study differences
between genomes, to try to account for phenotypic
of human disease. The D. melanogaster genome
divergence; and to study and apply similarities. One
contains homologues of human genes implicated
important application is to use laboratory animals as
in cancer and in cardiovascular, neurological, endo-
models for human diseases.
crinological, renal, metabolic, and haematological
150 4 Comparative Genomics
Table 4.8 Human disease-associated genes shared with worms, flies, and yeast
Bones Multiple exostoses Ossification at tips of femur, pelvis, or ribs EXT1 *** ** –
Blood Leukaemia Chronic myelogenous leukaemia, ABL1 *** *** *
a blood cell cancer
Bruton agammaglobulinaemia Lack of mature B cells BTK *** ** *
Glucose-6-phosphate Drug- and stress-induced rupture of G6PD **** **** ****
dehydrogenase deficiency red blood cells
Brain Early-onset Alzheimer’s disease Common cause of mental retardation PS1 ** ** –
Fragile X syndrome FMR1 ** – –
Juvenile Parkinson’s disease PARK2 *** ** *
Colon Hereditary non-polyposis cancer Polyps that become malignant MSH2 *** *** ***
Adenomatous polyposis APC *** * –
Ears Hereditary deafness MYO15 *** *** ***
Eyes Retinoblastoma Cancer of the eye RB1 * * –
Heart Familial cardiac myopathy Inherited cardiac disease MYH7 *** *** ***
Long QT syndrome Sometimes fatal cardiac arrhythmias 3-SCN5A *** ** *
Kidney Polycystic kidney disease 2 PKD2 ** ** –
Liver Wilson’s disease Build-up of copper in cells, causing liver disease ATP7B *** *** ***
and other symptoms
Lung Cystic fibrosis Progressive disease of lungs and pancreas CFTR *** *** –
Lung cancer Caused by defects in p53 gene, which can also p53 * – –
cause cancer of the oesophagus, colon,
brain, lung, breast, and skin
Muscles Duchenne’s muscular dystrophy Progressive atrophy of muscles DMD *** *** –
Pancreas Pancreatic cancer MADH4 *** * –
Pancreatic cancer RAS ** ** **
Prostate Advanced cancer of the prostate Caused by mutations in the PTEN gene, PTEN ** ** *
which can also cause cancer of the brain,
endometrium, and breast
Skin Xeroderma pigmentosum D Early-onset skin cancer XPD *** ** ***
Neurofibromatosis 1 Soft tumours at many sites, plus skeletal NF1 *** * **
and neurological defects
Thyroid Cancer of the thyroid Multiple endocrine neoplasia type 2 MEN2 *** ** *
Based on data from Rubin, G.M., et al. (2000). Comparative genomics of the eukaryotes. Science 287, 2204–2215. Presentation adapted
from http://www.hhmi.org/genesweshare/e400.html.
The ENCODE project 151
10 20 30 40 50 60 70
| | | | | | |
Human MTELPAPLSYFQNAQMSEDNHLSNTVRSQNDNRERQEHNDRRSLGHPEPLSNGRPQGNSRQVVEQDEEED
C. elegans MPSTRRQQEGGGADAETHTVYGTNLITNRNS _ _ _ _ QEDENVV
R QE H N NS DE
Figure 4.22 Alignment of the amino acid sequences of the human protein presenilin-1 and the C. elegans homologue SEL-12.
The ENCODE project (Encyclopedia of DNA ele- Regions corresponding to the selected human
ments) is a systematic development and application genome segments from 29 vertebrates will be
of comparative genomics. It has the ultimate goal of sequenced (see Table 4.9). These data will illuminate
developing methods for comprehensive identification each other. The ENCODE project will apply, improve,
of functional regions of the human genome, includ- and develop, as necessary, a variety of experimental
ing coding and regulatory regions. A selected portion and computational methods. Lessons learned from
of the human genome – 1%, about 30 Mb – will work with the selected subset will guide the scaling
be the initial focus. The basic approach will be com- up of successful methods to analysis of entire
parative genomics and will involve both laboratory genomes.
and computational analysis. Coordinating with ENCODE, the HapMap Pro-
High-quality sequences will be finished to state- ject (see Chapter 1) focuses on variations among
of-the-art standards, including resolving difficult humans in ten of the ENCODE regions. Sequences
regions. Medium-quality sequences will have >8-fold from 48 individuals from different geographic ori-
coverage, with manual refinement of assembly. Un- gins have yielded 30 000 SNPs.
finished sequences are whole-genome shotguns; the Analysis of function involves two steps: deciding
coverage may vary and assembly may be incomplete. whether a segment has functional significance and, if
152 4 Comparative Genomics
Quality of sequencing
Class:
Actinopterygii Zebrafish
Amphibia Frog
Aves Chicken
Class: Mammalia
Order: Suborder:
Monotremata Platypus
Marsupialia Opossum
Proboscidia African elephant
Insectivora Tenrec
Xenarthra Armadillo
Insectivora Hedgehog
Insectivora Shrew
Chiroptera Bat
Artiodactyla Cow
Carnivora Dog
Carnivora Cat
Rodentia Mouse
Rodentia Rat
Rodentia Guinea pig
Lagomorpha Rabbit
Primates Prosimii Galago
Primates Prosimii Mouse lemur
Primates Platyrrhini Duski titi
Primates Platyrrhini Owl monkey
Primates Platyrrhini Marmoset
Primates Catarrhini Colobus
Primates Catarrhini Macaque
Primates Catarrhini Baboon
Primates Hominidae Orang-utan
Primates Hominidae Chimpanzee
Primates Hominidae Human
The ENCODE project 153
(a) Human
Chimp
Baboon
Rhesus monkey
Green monkey
Dusky titi
Colobus
Spider monkey
(b)
100%
Percent
variation
50%
LXR-alpha Exon 3 0%
0 bp 600 bp 1200 bp
Figure 4.23 Patterns of variation in multiple sequence alignments can suggest regions of likely function. This diagram shows analysis
of a 1200 bp region in primate genomes containing an exon of the liver X receptor a gene and flanking regions. This gene encodes a
nuclear receptor responsive to elevated levels of intracellular cholesterol. (a) Human, reference sequence; purple, regions in which the
sequence of the indicated region differs from at least one other species. The columns with no purple are conserved in all species and
define regions likely to be functional. (b) Plot of % variation along the sequence. Regions of lowest variability correspond to the known
exon. (A similar but not identical approach was used by E.A. Kabat and T.T. Wu in their classic work identifying complementarity-
determining regions of antibodies from regions of hypervariability.)
From: Nobrega, M.A. & Pennacchio, L.A. (2004). Comparative genomic analysis as a tool for biological discovery. J. Physiol. 554, 31–39.
so, identifying what it does (see Figure 4.23). Approx- such as the a- and b-globin loci and the region con-
imately 5% of the human genome is conserved with taining CFTR, the gene for the cystic fibrosis trans-
respect to mouse and rat sequences. This 5% should membrane conductance regulator, for which sequence
have interesting functions (without implying that the information from different species is known.
other 95% does not). Only about one-third of this Sequences of the ENCODE target regions can be
5% is predicted to encode protein. Analysis of func- aligned and compared (see Figure 4.23).
tion will require treatment of both protein-coding In 2007, ENCODE moved into its second phase,
and non-protein-coding regions. and a companion project, modENCODE began
Accordingly, the criteria for selection of regions for (Table 4.10).
the ENCODE project included choosing regions with
ranges of gene density and of non-exonic conserva-
The modENCODE project
tion with respect to the mouse sequence. The result is
a set of 44 discrete regions, spread around different modENCODE extends the ENCODE project to model
human chromosomes and the syntenic regions in organisms. Its initial goal is to identify functional ele-
other species. These include well-studied regions ments in the C. elegans and D. melanogaster genomes.
154 4 Comparative Genomics
Table 4.10 Approximate sizes of ENCODE regions Current projects include but are not limited to
those that focus on the following:
Chromosome Approximate sizes of ENCODE regions (Mb)
(gene of interest) the transcriptome: as complete as possible a descrip-
tion of what elements of the genome are actually
1 0.5
transcribed, with a classification, as far as possible,
2 0.5, 0.5, 0.5, 0.5
of putative function – at least whether the transcript
4 0.5
is likely to code for protein or non-protein-coding
5 0.5, 0.5, 1.0 (interleukin)
RNA.
6 0.5, 0.5, 0.5, 0.5
chromatin function and histone variants: genomic
7 0.5, 1.0, 1.1, 1.2, 1.9 (CFTR)
distribution of modifications to histones and other
8 0.5
chromosome-associated protein regulatory elements.
9 0.5
the 3′ utr-ome: untranslated regions 3′ to coding
10 0.5
sequences are important sites for post-transcriptional
11 0.5, 0.5, 0.6, 0.5 (Apo cluster), 1.0 (b-globin)
regulation of expression.
12 0.5
transcription factors: identification of the DNA-
13 0.5, 0.5
binding sites, and measurement of expression pat-
14 0.5, 0.5
terns at different life stages.
15 0.5
For D. melanogaster: a complete roster of small and
16 0.5, 0.5, 0.5 (a-globin)
microRNAs and assignment of their functions when
18 0.5, 0.5 possible.
19 1.0
Note the focus on regulatory elements. Papers ap-
20 0.5
pearing in Science in late 2010 report modENCODE
21 0.5, 1.7
results for worm and fly, respectively. These include
22 1.7
reports of many new genes that encode proteins or
X 0.5, 1.2
RNAs.
● RECOMMENDED READING
• Papers treating some of the current debate in biological taxonomy and the strengths and
weaknesses of barcoding, and a description of the database:
Moritz, C. & Cicero, C. (2004). DNA barcoding: promise and pitfalls. PLoS Biol. 2, e354.
Ratnasingham, S. & Hebert, P.D.N. (2007). BOLD: The Barcode of Life Data System. Molecular
Ecology Notes 7, 355–364.
Stoeckle, M.Y. & Hebert, P.D. (2008). Barcode of life. Sci. Am. 299(4), 82–86, 88.
• Detailed review of work on genome comparisons and what they tell us about genome contents
and evolution:
Miller, W., Makova, K.D., Nekrutenko, A., & Hardison, R.C. (2004). Comparative genomics.
Annu. Rev. Genomics Hum. Genet. 5, 15–56.
• How modern developments in biological data collection affect the use of model organisms in the
study of human biology and disease:
Barr, M.M. (2003). Super models. Physiol. Genomics 13, 15–24.
Exercises, problems, and weblems 155
• Importance of duplications:
Levasseur, A. & Pontarotti, P. (2011). The role of duplications in the evolution of genomes
highlights the need for evolutionary-based approaches in comparative genomics. Biol.
Direct. 18, 11.
• Alternative splicing:
Park, J.W. & Graveley, B.R. (2007). Complex alternative splicing. Adv. Exp. Med. Biol. 623, 50–63.
• RNA editing:
Maas, S., Kawahara, Y., Tamburro, K.M., & Nishikura, K. (2006). A-to-I RNA editing and human
disease. RNA Biol. 3, 1–9.
Mass, S. (2010). Gene regulation through RNA editing. Discov Med. 10, 379–386.
Farajollahi, S. & Maas, S. (2010). Molecular diversity through RNA editing: a balancing act.
Trends Genet. 26, 221–230.
• ENCODE and modENCODE:
The ENCODE Project Consortium (2011). A user’s guide to the Encyclopedia of DNA Elements
(ENCODE). PLoS Biol. 9, e1001046.
Elsner, M. & Mak, H.C. (2011). A modENCODE snapshot. Nat. Biotechnol. 29, 238–240.
Muers, M. (2011). Functional genomics: the modENCODE guide to the genome. Nat. Rev.
Genet. 12, 80.
Exercises
Exercise 4.1 On a photocopy of Figure 4.4, (a) indicate the position of the human genome;
(b) indicate the position of the pufferfish genome; (c) indicate the position of the rice (Oryza
sativa) genome; (d) indicate the position of the yeast S. cerevisiae genome; (e) indicate the
position of the E. coli genome; (f) indicate the range of sizes of mitochondrial genomes; and
(g) indicate the range of sizes of chloroplast genomes.
Exercise 4.2 In H.G. Wells’ 1898 novel The War of the Worlds, invaders from Mars are overcome
by disease. Assuming that life on Mars developed independently of life on Earth, why is it unlikely
that the Martians died of viral infections?
Exercise 4.3 Which of the following pairs are orthologues? Which are paralogues? Which are
neither? (a) Human trypsin and horse trypsin; (b) human trypsin and horse chymotrypsin;
(c) human trypsin and human elastase; (d) Bacillus subtilis subtilisin and horse chymotrypsin.
Exercise 4.4 On a photocopy of Figure 4.1, indicate estimates of the dates of the events at the
points marked by asterisks.
Exercise 4.5 What RNA molecule is most closely linked to the his operon in E. coli (see Figure 4.9)?
Exercise 4.6 Human mitochondrial DNA is 16 569 bp long. A brain cell may contain 10 000
mitochondria. What fraction of the DNA in a brain cell is mitochondrial? Assume 1 mtDNA/
mitochondrion.
Exercise 4.7 Some antibiotics, for example streptomycin, block protein synthesis in bacteria but
not cytoplasmic protein synthesis in eukaryotes. Mitochondria and chloroplasts contain their own
protein-synthesizing machinery. Would you expect streptomycin to block mitochondrial protein
synthesis? Explain your answer.
156 4 Comparative Genomics
Exercise 4.8 On a photocopy of Figure 4.1, indicate by an arrow linking them the approximate
positions of the source and destination organisms for the endosymbiotic origins of mitochondria
and chloroplasts.
Exercise 4.9 Why could a mollusc that extracts chloroplasts from algae that it eats not simply let
the chloroplasts reside in some body cavity (like the symbiotic bacteria in the guts of ruminant
animals) rather than endocytosing them?
Exercise 4.10 The main E. coli chromosome contains 4 639 221 bp. The cell is roughly a cylinder
about 0.1 mm in diameter and 0.2 mm long. If the length of an extended segment of DNA in the B
conformation is 3.4 Å per base pair (0.34 nm), what would be the diameter of the chromosome if
it were geometrically a circle?
Exercise 4.11 On a photocopy of Figure 4.11, mark the positions that (a) contain the same amino
acid in all three eukaryotes but differ in S. aureus; (b) contain the same amino acid in humans and
chickens but are not the same in Neurospora and/or S. aureus; (c) contain the same amino acid in
humans and S. aureus, but this amino acid does not appear at this position in both of the other
two species.
Exercise 4.12 On a photocopy of Figure 4.11, indicate which regions of the amino acid sequences
are encoded by which blocks of the nucleotide sequences in Figure 4.12. (This exercise is similar to
Exercise 1.12.)
Exercise 4.13 On a photocopy of Figure 4.13, draw horizontal lines at the beginning and end of
the Devonian (see Figure 2.15). What is earliest type of species to emerge after the split between
neuroglobin and cytoglobin?
Exercise 4.14 Which animals in Figure 4.14 have the largest number of functional b-globin genes?
Exercise 4.15 Which animals in Figure 4.14 show a triplication of part of the b-region? Which
genes make up the repeating unit?
Exercise 4.16 Which animals in Figure 4.14 show a pseudogene most closely related to a d-globin?
Exercise 4.17 The human and chimpanzee genomes are 96% identical. (a) How many individual
bases of the human genome differ from the corresponding positions of the chimp genome?
(b) Assume that the human genome is 3% coding and contains about 20 000 genes, and that
all of the sequence differences are independent, single-base changes distributed randomly
throughout the genome (these assumptions are definitely not true). Estimate the fraction of
genes mutated between humans and chimpanzees.
Exercise 4.18 On a photocopy of Figure 4.1, indicate where three whole-genome duplications are
believed to have occurred.
Problems
Problem 4.1 On a copy of Figure 4.7(b), indicate the following interactions: (a) the positively
charged guanidino group in zanamivir (left in Figure 4.7b) forms salt bridges with Glu116 and
Glu225 in the neuraminidase active site; (b) hydroxyl groups of the glycerol moiety of zanamivir
(at the right in Figure 4.7b) are hydrogen bonded to Glu274; (c) the carbonyl oxygen of the
N-acetyl sidechain is hydrogen bonded to Arg149; (d) the methyl group of the N-acetyl sidechain
makes hydrophobic interactions with Ile220 and Trp176.
Problem 4.2 Read the letters to The New York Times (17 September 2005) discussing the decision
to publish the genome sequence of the 1918 pandemic strain of influenza virus. Summarize the
arguments for and against publication.
Problem 4.3 From the partial sequences in Figure 4.12, (a) how many positions (necessarily in the
coding regions) contain the same base in all three genomes? What percentage of positions contain
the same base in all three genomes? (b) How many positions in the coding regions are common to
Exercises, problems, and weblems 157
human and chicken but different in S. aureus? To what percentage of coding positions does this
correspond? (c) How many positions in the non-coding regions are common to human and
chicken? To what percentage of non-coding positions does this correspond?
Problem 4.4 For the coding regions of the genes for human and chicken thioredoxins (p. 135),
calculate the Ka/Ks ratio. What can you infer about the relative importance of selection and drift in
accounting for the difference in the corresponding amino acid sequences?
Problem 4.5 To what time frame can you date the duplication of the g gene in the b-globin locus
of primates? (See Figures 2.15, 4.13, and 4.14.)
Problem 4.6 Figure 4.24 shows an evolutionary tree of hominids. With reference to Figure 4.18,
can you find similarities in the chromosome structures that confirm these relationships between
human, common chimpanzee, gorilla, and orang-utan?
25 20 15 10 5 0
Pygmy chimp
Common chimp
Human
Gorilla
Orang-utan
Gibbon
Problem 4.7 Which substitutions between rodent and human cytochrome c would not be
considered conservative mutations?
Problem 4.8 For the following pairs of homologous proteins, what are the percentages of
identical amino acids in an optimal alignment and what are the percentages of identical residues
or conservative substitutions in an optimal alignment: (a) the cytochrome c proteins of humans
and C. elegans; and (b) presenilin-1 homologues of humans and C. elegans?
Weblems
Weblem 4.1 Identify a virus and a prokaryote such that the genome of the virus is larger than the
genome of the prokaryote (see Figure 4.4).
Weblem 4.2 Find examples of viruses with (a) a double-stranded circular DNA genome;
(b) a double-stranded linear DNA genome; (c) a single-stranded (+)sense RNA genome; and
(d) a single-stranded (−)sense RNA genome.
Weblem 4.3 What was the most common serotype of influenza in the USA during the 2010–2011
season (www.cdc.gov/flu/weekly)?
Weblem 4.4 Give an example of a species that represents the simplest known form of (a) metazoan;
(b) deuterostome; (c) placoderm; and (d) eutherian.
Weblem 4.5 Mitochondrial DNA is often edited before translation. When a mitochondrial gene is
transferred to the nucleus, there are two possibilities: (1) the copying is DNA→DNA. In this case,
the nuclear version will initially have the mitochondrial DNA sequence and will require mutation
158 4 Comparative Genomics
to effect the changes introduced in the mitochondrion by the editing; or (2) the nuclear gene is
reverse transcribed from the edited mitochondrial mRNA. Compare the sequences of the nuclear
gene for cytochrome oxidase II from mung bean (Vigna radiata) with mitochondrial sequences in
related legumes, before and after editing. Do the data support either hypothesis?
Weblem 4.6 Align the following sequences: (1) the nuclear-encoded rps14 gene from rice
(Oryza sativa); (2) the mitochondrial-encoded rps14 gene from broadbean (Vicia faba);
(3) the nuclear-encoded sdhB gene from rice. Identify the leader sequence in the rice rps14 gene,
targeting mitochondrial import, that appears to have been borrowed from the rice sdhB gene.
Weblem 4.7 Make two histograms of E. coli genes, similar to that of Figure 4.8, showing in one
of them the size distribution of genes appearing clockwise in the genome and in the other the
size distribution of genes appearing counter-clockwise in the genome. Describe any systematic
differences that appear.
Weblem 4.8 Many people believe that Rickettsiae are the closest extant relatives of the organism
that, after endosymbiosis, gave rise to mitochondria. Rickettsiae are obligate aerobes. (a) Can you
identify an enzyme that (1) catalyses an anaerobic function in mitochondria and (2) lacks a known
homologue in Rickettsiae. (b) If you can find one, and if the rickettsial origin of mitochondria is a
valid hypothesis, what are reasonable explanations of the presence of the anaerobic enzymatic
function in mitochondria? (c) How would you test these explanations?
Weblem 4.9 In eukaryotes, the recombination rate per kilobase – the physical distance that
corresponds to the genetic distance – varies among species by several orders of magnitude
overall. (It also varies within each genome.) It depends primarily on overall genome size. You
can find the data in computer-readable form through this book’s Online Resource Centre
(www.oxfordtextbooks.co.uk/orc/leskgenomics2e/). Draw graphs of recombination rate against
genome size, distinguishing data from different groups of organisms. (a) What relationship do
you observe, i.e. colloquially, what is the shape of the curve? (b) Can you plot the data in a way
that gives a linear relationship? (c) Do the data from the different groups of organisms follow
the same relationship? If not, how do they differ? (d) What general conclusions can you draw?
Weblem 4.10 Add to Figure 4.13 the corresponding partial gene sequence from N. crassa.
Describe its relationship to human, chicken, and S. aureus sequences. In particular, answer
questions analogous to those in Exercise 4.11 for the DNA sequences.
Weblem 4.11 What full-genome project is in progress, but not yet complete, for an organism in
each of following categories: (a) fungus; (b) amphibian; (c) land plant; (d) insect (not a species of
Drosophila); (e) primate.
Weblem 4.12 Collect and align sequences of the protein HSP70 from about six (of each)
Gram-positive bacteria, proteobacteria, other Gram-negative bacteria, and archaea. Identify
an insertion common to proteobacteria and other Gram-negative bacteria but absent from
Gram-postive bacteria and archaea. On this basis, sketch the topology of a phylogenetic tree
relating Gram-positive bacteria, proteobacteria, other Gram-negative bacteria, and archaea.
Where do these results suggest placing the root of the prokaryotic tree?
Weblem 4.13 The genetic code used in translation of genes in animal mitochondria differs from
the standard one. Was this feature simply inherited from the original symbiont that gave rise
to the organelle, or did it arise subseqently by divergence? Determine (a) what genetic code is
used by the living organism that appears to be the closest extant relative of the original symbiont,
Rickettsia prowazekii. (b) Do all animal mitochondria use the same variant of the standard genetic
code? Based on these data, suggest an answer to the question.
Weblem 4.14 What is significant about the following species that justified sequencing their entire
genome sequences? (a) Plasmodium falciparum; (b) Aspergillus fumigatus; (c) Tropheryma
whipplei; (d) Sulfolobus tokodaii; (e) sea urchin; (f) Ciona intestinalis.
Exercises, problems, and weblems 159
Weblem 4.15 In 1976, swine flu killed Private David Lewis, a soldier at Fort Dix, in central New
Jersey South of Princeton, USA. What was the response of the US Government, under President
Gerald Ford? What was the course of the potential epidemic? In retrospect, does the response
appear to have been the right one? Could you have advised President Ford to make a different
response? If so, based on what facts known at the time of Private Lewis’s illness?
Weblem 4.16 Compute Ka/Ks ratios for the genes for the two isoforms of D. melanogaster
cytochrome c. Does the divergence appear to have arisen from selection or drift?
Weblem 4.17 Compute Ka/Ks ratios for the genes for human presenilin-1 and the C. elegans
homologue sel-12. Does the divergence appear to have arisen from selection or drift?
Weblem 4.18 The classification of tarsiers among primates is ambiguous. Palaeontology places
tarsiers with lemurs, lorises, and galagos, in the suborder prosimians. Some genetic relationships
place tarsiers with monkeys, apes, and humans. It is also possible that tarsiers form a separate
prosimian infraorder. Find the structure of the b-globin region of the tarsier genome and explain
what these data suggest about where, in Figure 4.15, the tarsiers belong.
Weblem 4.19 The human homologues of C. elegans NPR-1 are the neuropeptide Y receptors
NPY1R, NYP2R, . . . What mutations are known in these human receptors and what, according
to Online Mendelian Inheritance in Man (OMIM™), are their effects?
Weblem 4.20 In humans, cholesteryl ester transfer protein is important in controlling blood
levels of high-density lipoproteins. Do homologues of this protein exist in (a) mouse; (b) rat;
and (c) hamster?
Weblem 4.21 Mouse and rat cytochrome c have identical amino acid sequences. Do they have
identical gene sequences also? If not, show the differences between the cytochrome c genes in
mouse and rat. Distinguish between exons and introns.
Weblem 4.22 How many cytochrome c pseudogenes are there in the human genome?
Weblem 4.23 Which of the following genes have the same number of exons in human and
mouse orthologues? (a) Haemoglobin A (human gene HBA); (b) cytochrome c (human gene
HCS); (c) spermidine synthase (human gene SRM).
Weblem 4.24 Are any regions containing globin genes, other than haemoglobin a and b, included
in the choices of ENCODE regions studied by the HapMap project? If so, which ones?
This page intentionally left blank
CHAPTER 5
LEARNING GOALS
In the preceding chapter, we described some of the differences observed in comparing closely
related species, and distantly related ones, at various levels from molecules to karyotypes. In this
chapter we seek to understand how these changes came about. The general answer is of course
evolution. The availability of the data from genomics and related fields – such as protein structure
determinations – challenges us to probe into the detailed mechanism by which evolutionary
changes occur. Tools for analysing similarities include sequence-alignment algorithms, and, from
the similarities, methods for generating phylogenetic trees. These tools are part of the essential
skill set for anyone working in the field of genomics.
162 5 Evolution and Genomic Change
Evolution is exploration
Evolution is exploration. Exploration leads to dis- However, some mutations send ripples through
covery. Discovery leads to change. Change can the system and have far-reaching effects. Small
appear as creativity. (Let us avoid the word progress, changes in single HOX genes have immense lever-
much-abused in this context.) But it is all based on age in creating the overall body plans of animals.
exploration. For example, a major change in body plan in
By exploration, we mean that life can probe the metazoans occurred about 400 million years ago,
vicinity of its current state, generating and testing with the emergence of insects with six legs from
variations. Evolution involves exploration and change arthropod ancestors with large numbers of legs.
at many levels. What makes biology so complex Experiments of W. McGinnis and co-workers
is that the changes at different levels are intimately showed that changes in one protein, Ubx, a HOX
linked. homologue, are sufficient to achieve this large-
scale anatomical transition.
• The most fundamental level of exploration is muta-
• At the chromosomal level, evolution can explore
tion of genome sequences. It is through mutations
distributions of genes. This can involve local or
and, in sexually reproducing organisms, allelic
global gene duplication and transposition of either
reassortment, that life explores the neighbourhood
small segments of chromosomes or large-scale
of a current genotypic and phenotypic state.
blocks. Degradation of synteny can lead to infertil-
A mutation can affect a transcribed molecule,
ity. This is one of the mechanisms of speciation.
changing either the amino acid sequence of a pro-
tein, or a base in a non-protein-coding RNA. • At the cellular level, evolution has explored differ-
Alternatively, changes in splice sites or regulatory ent kinds of organization, notably the prokaryote–
sequences can change protein expression levels. A eukaryote division. Some but not all cells have cell
mutation that causes loss of an essential function walls. Some, but not all, have chloroplasts. Com-
will be lethal – a blind alley of evolutionary ex- plex organisms develop many types of specialized
ploration. Conversely, other mutations might be cell, tissue and organs.
expected to have little or no effect. Such putatively • Individuals explore different possible life histories.
‘silent mutations’ include changes among synony- The context and interactions of our lives shape our
mous triplets in protein-coding genes, or muta- development. For humans more than other species,
tions in pseudogenes, or – presumably – mutations cultural heritage and experience have a great effect
in the large portions of some genomes currently on physical as well as on mental development.
described as junk. However, even mutations pro- We as individuals also have greater control over
ducing synonymous codons in protein-coding how we explore the potential inherent in our
genes could interact with transfer RNA (tRNA) genomes. This freedom is, of course, incomplete,
levels to affect translation rates, and thereby affect for societies are ecosystems and constrain our
protein structure (see pp. 9–10). development and activities.
• Altered proteins explore possibilities of altered • Within populations of individuals of the same
structure, including altered post-translational species, evolution explores varying distributions of
modification; and altered function, including allele frequencies. Natural populations show geno-
changes in enzymatic activity or changes in regula- typic and phenotypic variation. Even in the absence
tory signals. of mutations, populations can ‘react’ to changing
A single conservative amino acid substitution at conditions by varying gene frequencies: Industrial
a site on the surface of a protein distant from the melanism is a classic example (see Box 5.1).
active site would be expected to have only local- • At the level of body plan, a visit to a zoo or botanical
ized, self-contained effects on protein structure garden reveals life’s stunning variety. Comparative
and function. anatomy reveals the underlying similarities among
Evolution is exploration 163
One variety of the British pepper moth, Biston betularia darkened trees, the dark moths were better camou-
(variety typica), has a mottled black-and-white colouring. flaged. Within a century, the population had become 90%
The moths are nocturnal; during the day, they roost on tree variety carbonaria. Figure 5.1 shows, quite convincingly,
trunks. Before the rise of industrial pollution, light-coloured the differences in appearance of both trees and moths.
lichens encrusted the trees and the moths were protected The difference is a shift within the population of the allelic
from birds by camouflage. frequency distribution of a single gene.
Another variety, carbonaria, has a uniformly dark colour. Let it not be said that England did not take steps to curb
It was first observed, as a mutant, in the mid-19th century, air pollution. Coal fires were banned in London by Edward
near Manchester in the north of England. Historical collec- I in 1273, albeit only temporarily. Parliament followed up
tions show the rarity of the dark variety at that time. with the Clean Air Act of 1956. At present, B. betularia
The difference between the two varieties is controlled by a populations in areas recovering from soot deposits are
single gene that controls the amount of the black pigment, shifting back to higher proportions of the mottled typica
melanin, produced. variety.
As the industrial revolution advanced, soot killed the
lichens and blackened the bark of the trees. Against the
Figure 5.1 Industrial melanism. Left: light- and dark-coloured pepper moths (Biston betularia) on normal trees, with lichens
growing on the bark. Right: light- and dark-coloured pepper moths on lichen-free trees encrusted with soot. Each picture
contains two moths, one camouflaged and the other easily visible. Can you spot the camouflaged moths?
Reproduced with permission from: Kettlewell, H.B.D. (1956). Further selection experiments on industrial melanism in the Lepidoptera.
Heredity 10, 287–301.
different animals and plants, but also the different examples of co-evolution involve species in compe-
design solutions for structural, locomotory, and sen- tition or conflict; for example, predator–prey rela-
sory systems between vertebrates and invertebrates. tionships. These include the wars between humans
and pathogenic bacteria and viruses.
• At the level of ecosystems, different populations
explore their modes of interaction. Many pairs of The mechanism of evolutionary change is now
species co-evolve. Some examples of co-evolution understood, at least in general terms. Genetic reas-
of species are known from cooperating species; for sortment and mutation generate inheritable pheno-
instance, the correlation between the anatomy of typic variation. Phenotype-dependent differential
flowers and their insect pollinators – a subject rates of reproduction – that is, natural selection –
studied by Darwin himself – or the correlation governs which alleles, at which frequencies, are
between colour changes in fruit ripening and the passed on to succeeding generations. Alternatively,
development of colour vision in animals. Other even in the absence of selection, genetic drift can lead
164 5 Evolution and Genomic Change
to alterations in genome contents and distributions in one – occur at many levels, from individual genomes
populations. to proteins to cells to ecosystems. A long-standing
The two elements of exploration – variation from challenge of biology is to understand the relation-
the current state of the system and change to a new ships among these different levels of evolution.
Biological systematics
Classically, the unit of large-scale evolution is the Descriptions of new species are more common
species. Species represent nature’s experiments in than descriptions of new genera, families, etc. During
structures and lifestyles. Both Linnaeus and Darwin the years 1970–1998, five times as many new species
recognized the importance of species, and made and subspecies were described than new genera. New
them the focus of their work. The concept of species families, orders, classes, and phyla appear much more
remains essential, despite its attendant difficulties infrequently (Table 5.1).
(see p. 66). In view of the theoretical problems, bio- Biological names usually describe features of a
logists present the analysis of known species, and species (e.g. giant kangaroo, Macropus giganteus,
higher-order taxa, in terms based on tradition and which means large-foot gigantic, not to be confused
convention. We shall explore how modern analytic with the ‘bigfoot’ primate alleged to inhabit the
methods based on genomic and structural data mesh northwest USA and adjacent regions of Canada).
with the classical approaches. Names may indicate location (e.g. Virginia opossum,
Study of the vast variety of living organisms re- Didelphis virginiana), or recognize the discoverer
quires that we organize what we observe and measure. (e.g. Darwin’s rhea, Rhea darwinii, a large bird
We have to agree on what we call things. Biological encountered by Darwin on his visit to South Amer-
taxonomy encompasses identifying new life forms ica). J. Gould named the species in Darwin’s honour
(new in the sense of new to the scientific literature), in 1837. (P.H.G. Mohring had named the genus
deciding where they fit in, and assigning them a in 1752, to reflect the large size of the birds: the
name – based on some ‘real or fancied characteristic Greek goddess Rhea, mother of Zeus, was a female
of the form described’ (A.S. Romer) and equipped titan, member of a mythological race of giants.) The
with a proper description and deposition of specimen scientific name of Père David’s deer, Elaphurus david-
material. ianus, is another example of an animal named after
its discoverer.
Biological nomenclature
Two problems in organizing biological nomenclature Table 5.1 New taxa described from 1970 to 1998
are what to name and how to assign the names.
The taxonomic hierarchy – kingdom, phylum, class, Taxa Numbers described
order, family, genus, and species – introduced by Lin-
New phyla 11
naeus is still in use, although with modifications.
New classes 44
(Another legacy is the continued use of classical lan-
New orders 100
guages.) Members of more restrictive categories, for
New families 731
instance, several species in the same genus, have more
shared features and higher degrees of similarity than New genera 8 579
Other sources of names include expedition sponsors, With access to living populations, it is possible to get
thesis supervisors, and public figures such as kings a sense of the variability among individuals and to
and queens, politicians, artists, musicians, and sport- observe a variety of features, including physiology
ing figures. A marine mollusc, Rotaovula hirohitoi, and lifestyle, in addition to ‘static’ anatomy.
was named for the former Emperor of Japan, who was In contrast, for extinct species, palaeontologists
himself a serious marine biologist. Some scientists are often limited to fragmentary samples of hard
have named creatures with loathsome features after parts – bones and teeth – sometimes from a single
rivals, as insults. Finally, J.E. Winston has written: individual. Indeed, in the classical era of natural his-
‘These days, it would be considered pretty tacky . . . tory exploration, a museum in Europe would often
to name a species after yourself’.* For unusual and, receive only a preserved body, even from species that
in some cases, amusing examples, see http://home. still thrived elsewhere in the world. (Père David’s
earthlink.net/~misaak/taxonomy.html, http://cache. deer is a typical example.) Despite these handicaps
ucr.edu/~heraty/menke.html and http://cache.ucr. in data collection, biologists in the pre-molecular era
edu/~heraty/yanega.html#ECOLOGY. built their taxonomic edifice on studies of compara-
Biological nomenclature is governed by interna- tive anatomy, embryology, and stratigraphy for dates.
tional agreements, adopted by consent by professional They developed spectacular expertise: it was said
scientists. The International Codes of Zoological and that Cuvier, the founder of vertebrate palaeontology,
Botanical Nomenclature separately offer rules for could reconstruct the entire skeleton of an animal
the naming of animals and plants. Nomenclature of from a single bone.
bacteria grew out of, and eventually split off from, Understanding biological diversity requires obser-
the botanical code. Virologists have developed their vation and measurement of similarities and differ-
own classification. Currently the International Union ences. What features should one compare? Classically,
of Biological Sciences, a member of the International the choice depended on expertise and experience.
Council of Scientific Unions, concerns itself with bio- W.E. Le Gros Clark wrote,
logical nomenclature. It is a sponsor of the Species
While it may be broadly accepted that, as a general pro-
2000 project, an effort to curate a complete and inte-
position, degrees of genetic relationship can be assessed
grated database of the world’s species, including
by noting degrees of resemblance in anatomical details,
plants, animals, fungi, and microbes. The Species
it needs to be emphasized that morphological characters
2000 project coordinates its activities with other simi- vary considerably in their significance for this assessment.
lar international efforts, including the Interagency Consequently it is of the utmost importance that particular
Taxonomic Information System (ITIS) and the Global attention should be given to those characters whose taxo-
Biodiversity Information Facility (GBIF). The related nomic relevance has been duly established by comparative
projects of the Tree of Life (http://tolweb.org/tree/) anatomical and palaeontological studies.
and ARKive (http://www.arkive.org/) include pictor- – Le Gros Clark, W.E. (1971). The Antecedents of Man,
ial databases. 3rd edn. Quadrangle Books, Chicago, pp. 11–12.
The World Conservation Union, usually known
This approach works fine in the hands of a pro-
by its former name, the International Union for the
fessional with expertise and as distinguished as Le
Conservation of Nature and Natural Resources
Gros Clark, but it has also elicited attempts to
(IUCN), maintains a ‘Red list’ of endangered species.
make classification methods more quantitative and
objective. These attempts include the development of
Measurement of biological similarities and computational methods for interpreting similarities
differences of a wide spectrum of features, some but not all based
on sequence data.
Ultimately, comparisons in biology involve observa-
tions of differences between individual organisms.
Molecular techniques
* Winston, J.E. (1999). Describing Species. Columbia Uni- Many molecular properties have been used for
versity Press, New York, p. 165. phylogenetic studies, some surprisingly long ago.
166 5 Evolution and Genomic Change
Products of evolution retain similarities. The similar- possible objects of such analysis – sequences of indi-
ities appear at many levels – related people, recently vidual genes, full-genome sequences, sequences and
diverged species, tissues within an organism contain- structures of proteins, anatomical features, patterns
ing related cell types but varying protein expression of development, and any other phenotypic character
patterns, amino acid sequences and structures of pro- one might choose. In many cases, the patterns of sim-
teins, and DNA sequences. A major theme of biology ilarity between different features of a set of species
traditionally has been to recognize and classify such give corresponding results, bolstering our confidence
similarities, with a view to understanding how they in their significance.
arose and, when appropriate, to what purpose (i.e. However, it is necessary to keep clearly in mind
with what selective advantage). that similarity, which is observable, is a surrogate for
To trace the course of evolution, we must quantita- relationship, which usually is not. Related biological
tively measure such similarities. There are many objects are homologues, or families. In many cases,
Pattern matching – the basic tool of bioinformatics 167
such as the globins, the similarities are sufficient But divergence does not stop within the scope of
to give us confidence that we are analysing a family our ability to detect homologues, and there are many
of related molecules. Ideally, we have a spectrum of cases where (a) a tantalizing tenuous degree of simi-
similarities, including close relatives and some dis- larity suggests homology, but we remain unsure whe-
tant ones, with the distant relatives linked by chains ther or not the inference of relationship is valid, or
of close ones. For the comparisons of globins from (b) there is sufficient dissimilarity between two mole-
different species, the congruence of the degrees of cules or structures that homology is unsuspected, but
similarity of the molecules with measurements of a series of missing links clearly connects the two.
similarities of other sets of molecules and with the Our most precise tools measure similarity between
classical taxonomic relationships between species is sequences or between molecular structures. These
reassuring. For globins within a single species, the tools are mature methods. They have been calibrated
common conserved features argue that we are deal- to allow us to decide, in all but the hardest cases,
ing with a single diverged family. whether or not we are dealing with homologues.
(b) 10 20 30 40 50 60
| | | | | |
Human LSLEPVYWNSANKRFQAEGGYVLYPQIGDRLDLLCPRARPPGPHSSPNYEFYKLYLVGGA
Xenopus SLDPIYWNSSNKRFEDTEGYVLYPQIGDRLDLLCPRSEPQGPFSSSPYEYYKLYLVGTK
SL P YWNS NKRF GYVLYPQIGDRLDLLCPR P GP SS YE YKLYLVG
Human PFCPHYE_
Xenopus PANIYYKV
P Y
Figure 5.5 Relationships between the sequences of ephrin B3 proteins from human and Xenopus laevis. (a) Dot plot. The major
signal is along the main diagonal, interrupted by occasional divergent regions, and showing the substantially weaker similarity near
the C terminus. (b) Sequence alignment. Amino acids are colour coded by physicochemical type. Letters under the sequences indicate
positions occupied by the same residue in both sequences.
Pattern matching – the basic tool of bioinformatics 171
O O O O O O O
mal number of ‘edit operations’ required to change
➙
R R R
➙
O O O O O O O
one string into the other, where an edit operation
➙
T T T
➙
Y Y
character in either sequence.
➙
H H ➙➙➙➙➙➙➙➙H
➙
O O O O O O O
For example:
➙
D D D
➙
G G
agtc Hamming distance = 2
➙
K K
➙
I I
cgta
➙
N N
ag-tcc Levenshtein distance = 3
Figure 5.6 A path through the Dorothy Hodgkin dot plot.
Diagonal arrows correspond to aligned residues, horizontal
cgctca
arrows to gap insertions. The corresponding alignment is the A given sequence of edit operations induces a
obvious one:
unique alignment, but not vice versa.
DOROTHY--------HODGKIN
For applications to molecular biology, we wish to
DOROTHYCROWFOOTHODGKIN
assign variable weights to different edit operations.
For nucleic acids, we know that transition muta-
Multiple sequence alignment reveals the underlying
tions (purine↔purine and pyrimidine↔pyrimidine;
patterns contained in a set of related sequences much
i.e. A↔G and T↔C) are more common than trans-
more clearly than pairwise sequence alignments.
versions (purine↔pyrimidine; i.e. (A or G)↔(T or
Programs for all of these different alignment prob-
C)). For proteins, amino acid substitutions tend to be
lems are available on the web:
conservative: the replacement of one amino acid by
Global alignment (pairwise and multiple) another with similar size or physicochemical pro-
perties is more likely to occur than its replacement
CLUSTAL W http://www.ebi.ac.uk/clustalw
by another amino acid with dissimilar properties.
T-Coffee http://www.igs.cnrs-mrs.fr/
Similarly, the deletion of several contiguous bases or
Tcoffee/tcoffeecgi/index.cgi
amino acids is more probable than the independent
EMBOSS http://www.ebi.ac.uk/emboss/
deletion of the same number of isolated bases.
align
A computer program can score each path through
Local alignment the dot plot by adding up the scores of the individual
steps. For each substitution, it adds the score of the
SSEARCH http://pir.georgetown.edu/
mutation, depending on the pair of residues involved.
pirwww/search/pairwise.shtml
For horizontal and vertical moves, it adds a suitable
EMBOSS http://www.ebi.ac.uk/emboss/
gap penalty.
align/
Scoring schemes
Defining the optimum alignment A scoring system must account for residue substitu-
tions and insertions or deletions. (An insertion from
To go beyond ‘alignment by eyeball’ via dot plots, we
one sequence’s point of view is a deletion as seen by
must define quantitative measures of sequence simi-
the other.) Deletions, or gaps in a sequence, will have
larity and difference.
scores that depend on their lengths.
Given two character strings, two measures of the
For nucleic acid sequences, it is common to use
distance between them are as follows:
a simple scheme for substitutions, +1 for a match,
• the Hamming distance, defined between two −1 for a mismatch, or a more complicated scheme
strings of equal length, is the number of positions based on the higher frequency of transition muta-
with mismatching characters. tions than transversion mutations. One possibility is:
172 5 Evolution and Genomic Change
BOX The BLOSUM62 matrix used for scoring amino acid sequence similarity
5.2
Rows and columns are in alphabetical order of the three- substitution is the same as the rate of its reverse, but
letter amino acid names. Only the lower triangle of the because it is difficult to determine the differences between
matrix is shown, as the substitution probabilities are taken the two rates).
as symmetric (not because we are sure that the rate of any
Ala (A) 4
Arg (R) −1 5
Asn (N) −2 0 6
Asp (D) −2 −2 1 6
Cys (C) 0 −3 −3 −3 9
Gln (Q) −1 1 0 0 −3 5
Glu (E) −1 0 0 2 −4 2 5
Gly (G) 0 −2 0 −1 −3 −2 −2 6
His (H) −2 0 1 −1 −3 0 0 −2 8
Ile (I) −1 −3 −3 −3 −1 −3 −3 −4 −3 4
Leu (L) −1 −2 −3 −4 −1 −2 −3 −4 −3 2 4
Lys (K) −1 2 0 −1 −3 1 1 −2 −1 −3 −2 5
Met (M) −1 −1 −2 −3 −1 0 −2 −3 −2 1 2 −1 5
Phe (F) −2 −3 −3 −3 −2 −3 −3 −3 −1 0 0 −3 0 6
Pro (P) −1 −2 −2 −1 −3 −1 −1 −2 −2 −3 −3 −1 −2 −4 7
Ser (S) 1 −1 1 0 −1 0 0 0 −1 −2 −2 0 −1 −2 −1 4
Thr (T) 0 −1 0 −1 −1 −1 −1 −2 −2 −1 −1 −1 −1 −2 −1 1 5
Trp (W) −3 −3 −4 −4 −2 −2 −3 −2 −2 −3 −2 −3 −1 1 −4 −3 −2 11
Tyr (Y) −2 −2 −2 −3 −2 −1 −2 −3 2 −1 −1 −2 −1 3 −3 −2 −2 2 7
Val (V) 0 −3 −3 −3 −1 −2 −2 −3 −3 3 1 −2 1 −1 −2 −2 0 −3 −1 4
A R N D C Q E G H I L K M F P S T W Y V
Pattern matching – the basic tool of bioinformatics 173
calculated the ratio of the number of observed pairs 0 for a mismatch) and gap penalties of 10 for gap
of amino acids at any position to the number of pairs initiation and 0.1 for gap extension by one residue.
expected from the overall amino acid frequencies. For aligning protein sequences, the recommendations
In order to avoid overweighting closely related are to use the BLOSUM62 matrix for substitutions,
sequences, the Henikoffs replaced groups of proteins with gap penalties of 11 for gap initiation and 1 for
that had sequence identities higher than a threshold gap extension by one residue.
by either a single representative or a weighted aver-
age. The threshold of 62% similarity produces the
commonly used BLOSUM62 substitution matrix. • To define optimal alignment, we must assign scores for
each possible substitution and corresponding scores
This is offered by all programs as an option and is the
for gap initiation and extension.
default in most.
The BLOSUM62 matrix is shown in Box 5.2. It
expresses scores as log-odds values:
Approximate methods for quick screening
Score of observed i↔j mutation rate of databases
= log10
mutation i↔j mutation rate expected from
amino acid frequencies It is routine to screen genes from a new genome
against databases, to find similarities to other
The numbers are multiplied by 10, to avoid decimal sequences. Databases have grown so large that pro-
points. The matrix entries reflect the probabilities of grams based on exact local alignments are too slow.
mutational events. A value of +2 (e.g. leucine↔ Approximate methods can detect close relationships
isoleucine) implies that in related sequences the well and quickly but are inferior to the exact ones
mutation would be expected to occur 1.6 times more in picking up very distant relationships. In practice,
frequently than random. The calculation is as fol- they give satisfactory performance in the many cases
lows: the matrix entry 2 corresponds to the actual in which the probe sequence is fairly similar to one or
value 0.2 because of the scaling. The value 0.2 is log10 more sequences in a databank, and they are, there-
of the relative expectation value of the mutation. As fore, certainly worth trying first.
log10(1.6) = 0.2, the expectation value is 1.6.
The probability of two independent mutational The original paper on BLAST: Altschul, S.F., et al. (1990).
events is the product of their probabilities. By using Basic local alignment search tool. J. Mol. Biol. 215,
logarithms, we have scores that we can add up rather 403–410, was the field’s most highly cited paper
than multiply, a computational convenience. published in the 1990s.
Scoring insertions and deletions, or ‘gap weighting’ A typical approximation approach such as BLAST
To form a complete scoring scheme for alignments, (basic local alignment search tool) takes a small inte-
we need, in addition to the substitution matrix, a ger k and determines all instances of each ‘word’ of
way of scoring gaps. How important are insertions length k (i.e. each set of k consecutive characters,
and deletions, relative to substitutions? We need to with no gaps) of the probe sequence that occur in any
distinguish gap initiation: sequence in the database. A candidate sequence is a
sequence in the databank containing a large number
aaagaaa
of matching k-tuples, with equivalent spacing in
aaa-aaa
probe and candidate sequences. For a selected set of
from gap extension: candidate sequences, approximate optimal alignment
calculations are then carried out, with the time- and
aaaggggaaa
space-saving restriction that the paths through the
aaa----aaa
matrix considered are restricted to bands around the
For aligning DNA sequences, the popular alignment diagonals containing many matching k-tuples. It is
software package CLUSTALW recommends use of clearest to show the procedure in terms of a dot plot
the identity matrix for substitution (+1 for a match, (see Figure 5.7).
174 5 Evolution and Genomic Change
xxxxxxxxxxxxxxxxxxxx
x
x Database to be searched
x
x
x
x
(1) Empty x
Probe sequence
dot plot
xxxxxxxxxxxxxxxxxxxx
x
x x x
x x x x x
x x x x x x
x x x x x x
x x x x x
x x x
(2) Word x
x
x
x
x
lookup x
x
x x x x
x x x x
x x x x
x x x x
xxxxxxxxxxxxxxxxxxxx
x x
x x x x
x x x x x x
x x x x x x x
x x x x x x
x x x x x x x
x x x x x x
(3) Match x
x
x x
x
x
extension x x x x x
x x x x x x
x x x x x x
x x x x x x
x x x x x x
x x x x
x x x
xxxxxxxxxxxxxxxxxxxx
x x
x x x x
x x x x x x
x x x x x x x
x x x x x x
x x x x x x x
(4) Local x x x x x x
x x ★ x x
gapped x ★ ★ x
x x x x x
x x x x x x
alignment x x x x x x
x x x x x x
x x x x x x
x x x x
x x x
Figure 5.7 Schematic diagram showing the mechanism of a BLAST search. BLAST solves the problem of finding matches of a probe
sequence in a full genome or a full database that are much longer than the probe sequence.
(1) The ‘playing field’ of the algorithm is the outline of a dot plot, just as if the problem were going to be solved by application of an
exact-alignment method.
(2) BLAST first divides the probe sequence into fixed-length words of length k; here k = 4. It then identifies all exact occurrences of
these words in the full database, with no mismatches or gaps. Note that the same four-letter word may occur several times in the
probe sequence (shown here in red), and of course each four-letter word may match many times within the database. It is possible
to do this step quickly after pre-processing the database to record the sites of appearance of all four-letter words.
(3) Starting with each match, BLAST tries to extend the match in both directions, still with no mismatches or gaps allowed.
(4) Given the extended matches, BLAST tries to put them together by doing alignments allowing mismatches and gaps, but only within
limited regions containing the preliminary matches (grey areas). The result of this step is to add to the matches the positions shown
as ★. This produces longer matching regions.
It is the restriction of the more complex matching procedure to relatively small regions, rather than applying it to the entire matrix, that
gives the method its speed. The price to pay is that the method will miss a combined match lying outside the grey area. In the example
illustrated, the matching regions coloured red and green, at the right of the matrix, will not be combined but reported as separate hits.
There are several variations on this theme, includ- which residues are crucial (and therefore conserved).
ing the original BLAST program and its variants (see They also help us to identify distant homologues
Box 5.3). with greater confidence than a pairwise sequence
alignment could.
Multiple sequence alignments and The patterns inherent in a multiple sequence align-
pattern detection ment are not merely inferences from the alignment
Multiple sequence alignments are rich in information table – this is leaving it too late – but can actively con-
about patterns of conservation. They helps us to tribute to creating a high-quality alignment. The idea
understand the common features of structure and is for an algorithm to learn the underlying patterns
function of a family of sequences, by showing us while it is assembling the multiple sequence alignment.
Pattern matching – the basic tool of bioinformatics 175
Input
BOX Different ‘flavours’ of BLAST sequence
5.3 search different databases
Protein BLAST Filter results
sequence search E < threshold
Program Searches for: In: databank
generates sequences of nucleotides, or amino acids, structural similarity quantitatively? Can we derive a
according to rules that govern the probability dis- sequence alignment from structural comparisons?
tributions of successors to each letter. For example, In the native state of a protein, the mainchain
an adenine would be assigned a set of probabilities follows a curve in space. The general spatial layout of
for being followed by another adenine, or a thymine, this curve defines a folding pattern. The backbones
or a cytosine, or a guanine, or a gap. A different set of related proteins show recognizably similar but not
of probabilities would govern the successors of a thy- identical folding patterns. A letter of the alphabet in
mine, a guanine, a cytosine, or a gap. The enhanced different type fonts – for instance, b and b – illustrate
power of hidden Markov models over position- the topological similarities, and differences, in detail
specific scoring matrices stems from the correlation seen among proteins related by evolutionary diver-
between successive positions. gence that share a common folding pattern. A better
For example, it is observed that the dinucleotide analogy for widely divergent proteins might be the
frequency CpG is lower in higher organisms than letters B and R, which share the letter P as a common
would be expected from the overall mole fractions core substructure but in addition have either a loop
of C and G in the genome. A hidden Markov model (B) or a stroke (R) that differ. Homologous protein
would reflect this in a lowered probability of G in a structures typically contain fairly large, well-fitting
position following a C, relative to the probabilities of substructures. Figure 5.9 shows a superposition of
a G following A, T, or another G. local regions of two proteins and an overall super-
Given a set of sequences, the process of training position of two entire structures.
a hidden Markov model involves adjusting all of Extraction of the maximum common substructure
the probability distributions so that the sequences induces an alignment of the sequences. This is called
generated by the model have a high probability of a structural alignment. Because structure changes
reproducing the set of sequences analysed. more conservatively than sequence during evolution,
for distantly related proteins it may be possible to
align the sequences on the basis of the structures even
• A multiple sequence alignment is much richer in
if methods based purely on sequences cannot recog-
information than a pairwise sequence alignment. A
hidden Markov model is a method for capturing the
nize the relationship.
information.
• A structure alignment is nevertheless an alignment =
an assignment of residue–residue correspondences.
Pattern matching in three-dimensional Instead of assigning the correspondence by matching
structures the characters in two or more sequences, a structural
alignment assigns the correspondence to residues
Given two or more structures – perhaps of several
that occupy similar positions in space, relative to the
homologous proteins – we can frame questions gener-
molecular framework.
alized from sequence alignment. How can we measure
Extending Crick’s classic ‘central dogma’ gives us the Transcription of DNA to RNA, and translation of
basic paradigm: mRNA by ribosomes, takes us as far as the amino acid
sequence. The amino acid sequence dictates the pro-
DNA → RNA → amino acid sequence of a protein
tein structure by a spontaneous folding process (see
→ protein structure → protein function
Chapter 1). Folding produces a native state. For most
During evolution, selection acts on protein function proteins, the native state structure contains an active
to alter gene frequencies in populations, closing the site with the proper geometry, charge distribution, and
loop back to DNA. hydrogen-bonding potential to interact specifically
Evolution of protein sequences, structures, and functions 177
(a)
(b)
with other proteins, or with small-molecule ligands. Changing one amino acid without otherwise altering
In many cases, the active site contains catalytic resi- the structure would leave most interactions intact,
dues that produce enzymatic activity. except for those involving the mutated residue
itself (and conservative mutations may preserve even
these). Nevertheless, sometimes changing a single
The effects of single-site mutations
residue is enough to blow the original structure
The native states of proteins are the cumulative effect apart. An example is the mutation A174D in human
of many inter-residue interactions. What then should aldolase (See Box 5.4). In other cases changes to
we expect to be the result of a perturbation in the residues providing specific interactions with ligands
amino acid sequence? may alter activity. Some mutations do not alter
Consider a SNP leading to a single amino acid the structure but destabilize it; frequently this is
substitution. Will the structure stay the same? enough to cause disease. Box 5.4, treating human
178 5 Evolution and Genomic Change
The enzyme aldolase catalyses the cleavage of two and deletions including frameshifts, and changes in splice
substrates: sites. The most common mutations are A149P (over 50%
of cases worldwide) and A174D.
fructose-1,6-bisphosphate → glyceraldehyde-3-
T. Cox and co-workers* characterized the proteins cor-
phosphate + dihydroxyacetone phosphate
responding to several known mutants. Normal aldolase B
fructose-1-phosphate → glyceraldehyde + is a tetramer of four 363-amino-acid subunits. Because
dihydroxyacetone phosphate all mutants were discovered in patients presenting with
Fructose-1,6-bisphosphate is a mainstream metabolite Hereditary Fructose Intolerance, all had reduced or absent
in glycolysis and gluconeogenesis, classic pathways of enzymatic activity. Two classes of mutants were:
glucose metabolism. Fructose-1-phosphate arises in meta- • Catalytic mutants: these can be expressed as intact
bolism of dietary fructose. Different isozymes of aldolase tetramers, retaining some activity at 37°C. These include
have different relative activities towards fructose-1,6- W147R and R303W.
bisphosphate and fructose-1-phosphate.
• Structural mutants: these are destabilized, and show
Approximately 1 in 20 000 people suffer from hereditary
catalytic activity (if at all) only after expression at 22–
fructose intolerance, a defect in the liver isozyme, aldolase
23°C. These include N334K, A149P, L256P, and A174D.
B. The gene encoding this protein maps to locus 9q22.3
in the human, giving the trait an autosomal recessive The substitutions in the catalytic mutants occur in or near
inheritance pattern. For affected individuals, ingestion of the active site. The substitutions in the structural mutants
fructose, a monosaccharide common in fruits and honey, occur either in a residue buried in the monomeric structure
leads to vomiting, discomfort, and hypoglycaemia. Problems (A174D, which does not fold at all, as a result of burying
often first appear in infancy as fructose and sucrose are a charged sidechain), or in the subunit interface, which
added to the diet upon weaning. The condition can be fatal causes the protein to dissociate into monomers (N334K,
if unrecognized and untreated; however, for most patients L256P, and A149P).
it is sufficient to adopt a diet free of fructose and sucrose.
Numerous mutations have been associated with aldolase
* Rellos, P., Sygusch, J., & Cox, T.M. (2000). Expression, purifica-
B dysfunction, including amino acid substitutions, non- tion and characterization of natural mutants of human aldolase
sense mutations producing truncated protein, insertions B.J. Biol. Chem. 275, 1145–1151.
Similar Similar
sequences functions
Similar Similar
structures structures
Similar Similar
sequences sequences
Similar Similar
functions structures
When a protein evolves to change its function, by proteins – adapting to a novel function with
many of these constraints are released – or, more pre- relatively little sequence change.
cisely, replaced by alternative constraints required by Conversely, proteins with very different sequences
the new function. The relationship between sequence and structures can have the same function. For in-
and function is much more complex than the rela- stance, many families of proteinases differ in sequence
tionship between sequence and structure. Small and structure, sharing only a common general cata-
changes in sequence, during evolution, usually make lytic activity. Figure 5.13 summarizes, in a schematic
only small changes in structure. Often they make way, the landscape of protein space with respect
only small changes in function also. But changes in to the relations among sequence, structure, and
function do not necessarily require large changes in function.
sequence or structure – function can jump. All three features of proteins – sequence, structure,
Indeed, a protein can change function without any and function – are potentially useful in interpreting
sequence changes at all. For instance, in the duck, new genome sequences. We expect that many regions
an active lactate dehydrogenase and an enolase serve of the new genome encode proteins similar to rela-
as crystallins in the eye lens, although they do not tives known in other species. We can find them by
encounter the substrates in situ. In other birds, looking for similar patterns in the sequences. We can
crystallins are closely related to enzymes, but some expect that the structures will be similar, and indeed
divergence has already occurred, with loss of cata- can calibrate the expected difference in structure from
lytic activity. (This proves that the enzymatic activity the extent of divergence in sequence. As Figure 5.13
is not necessary in the eye lens.) Many other such ex- shows, however, we cannot be as confident in assum-
amples are known of ‘recruitment’, or ‘moonlighting’, ing that function will be conserved.
Phylogeny 181
Phylogeny
Once we have measured the similarity of one or more The arrangement is derived from observed similar-
properties of a set of individuals, or species, we can ities. The basic principle is that the origin of similarity
try to arrange them according to their apparent pat- is common ancestry. Although there are many excep-
tern of divergence. If in fact the similarities do arise tions, arising from convergent evolution or hori-
during descent from a common ancestor, it should be zontal gene transfer, this basic principle is crucial both
possible to depict the relationships in a family tree for rationalizing contemporary observations and for
(for individuals), or phylogenetic tree (for species and opening a window onto the history of life.
higher taxa). The goal is to present the pattern of From phylogeny, we infer relationships – among
similarities and divergences in a consistent diagram species, populations, individuals, or genes. Relation-
such that close relationships within the diagram cor- ship is taken in the literal sense of kinship or genea-
respond to high degrees of similarity. logy, i.e. assignment of a scheme of ancestors and
The assumption is that such a diagram will have descendants (see Box 5.5).
the form of a tree. (We saw an example of a tree in
Figure 4.1.) The computations that derive the optimal
phylogenetic tree from a matrix of similarities are
• A phylogenetic tree is a diagram showing ancestor–
not trivial, and the problem has been a challenge in
descendant relationships, that captures a pattern of
research for some time.
similarities, in that individuals or species closely linked
The goal of phylogeny is a logical arrangement of
in the tree have high similarity.
a set of species, populations, individuals, and genes.
• Homology means, specifically, descent from a common one another than they are to other objects outside the
ancestor. classes. Most people would agree about degrees of simi-
• Similarity is the measurement of resemblance or differ- larity, but clustering is more subjective. When classifying
ence, independent of the source of the resemblance. objects, some people prefer larger classes, tolerating
Similarity is observable now and involves no historical wider variation; others prefer smaller, tighter classes.
hypotheses. In contrast, assertions of homology require They are called, respectively, groupers and splitters.
inferences about historical events, which are almost • Hierarchical clustering is the formation of clusters of
always unobservable. clusters of . . .
• Similarity and dissimilarity. Data suitable for phyloge- • The distinction between clustering and classification.
netic analysis may be specified equivalently in terms Clustering is the determination of the set of classes into
of similarities between objects or by dissimilarities. In which a group of samples should be divided. Classifica-
comparing two DNA sequences, we may count the per- tion is the assignment of a sample to its proper place
centage of identical residues in an optimal alignment. in a known set of classes.
This is a measure of similarity – the higher the value, • Phylogeny is the description of biological ancestor–
the more similar the sequences. Alternatively, we could descendant relationships, usually expressed as a tree.
count the number of mutations separating the sequences. A statement of phylogeny among objects assumes
This is a measure of dissimilarity. homology and depends on classification.
• Clustering is bringing together similar items, distinguish-
ing classes made up of objects that are more similar to
182 5 Evolution and Genomic Change
In computer science, a tree is a particular kind of graph. relationships, as from any node there is a connected path
A graph is a structure containing nodes (abstract points) up through successive ancestors terminating at the root.
connected by edges (represented as lines between the Unrooted trees show the topology of relationship but not
points). A path between two nodes in a graph is a series the pattern of descent.
of consecutive edges that begins at one node and ends in It may be possible to assign numbers to the edges of
the other. In a general graph, there may be many paths a graph to indicate some kind of ‘length’ of the edges,
between any two nodes. (In Chapters 11 and 12, we dis- corresponding to a ‘distance’ between the nodes that the
cuss graphs in more detail.) edges connect. These lengths are not necessarily geometric
A tree is a special kind of graph. First of all, a tree must distances, but may be abstract values. Given edge lengths,
be connected, meaning that there is a path through the the graph may be drawn to scale, with the sizes of the
graph between any two points. Second, in a tree there is edges proportional to the assigned lengths.
only one path between every two points. We have already In phylogenetic trees, edge lengths signify either some
seen several trees; for example, Figure 4.1. measure of the dissimilarity between two taxa, or the length
A particular node may be selected as a root of a tree. of time since their separation. The assumption that differ-
However, this is not necessary – abstract trees may be ences between properties of living species reflects their
rooted (for instance, Figure 5.14) or unrooted (for instance, divergence times will be true only if the rates of divergence
Figure 5.15). In phylogenetic trees, the root is the earli- are the same in all branches of the tree. Many exceptions
est common ancestor of all of the other nodes. Rooted are known. For instance, among mammals, many proteins
phylogenetic trees explicitly show ancestor–descendant from rodents show relatively fast evolutionary rates.
Phylogeny 183
Figure 5.15 Unrooted tree of relationships among finches from the Galapagos and Cocos Islands. Darwin studied the Galapagos finches
in 1835, noting the differences in the shapes of their beaks and the correlation of beak shape with diet. Finches that ate fruits had beaks
like those of parrots, whereas finches that ate insects had narrow, prying beaks. These observations were seminal to the development of
Darwin’s ideas. As early as 1839 he wrote, in The Voyage of the Beagle, ‘Seeing this gradation and diversity of structure in one small,
intimately related group of birds, one might really fancy that from an original paucity of birds in this archipelago, one species had been
taken and modified for different ends’.
The relationships among the finches of the Galapagos ter similarities be relatively high and the intercluster
Islands, studied by Darwin, plus a related species similarities be relatively low. If the data are intrinsi-
from the nearby Cocos Island are shown in an cally well grouped, then the clustering is obvious.
unrooted tree (Figure 5.15). Addition of data from a If there is no clear separation, proper clustering is
species on the South American mainland ancestral to ambiguous and difficult.
the island finches would allow us to root the tree.
the difference between the average adult height of {ATCC, ATGC} 0 –12 (2 + 3) = 2.5 –12 (4 + 3) = 3.5
members of two species, or one could use the number TTCG 0 2
of different bases in alignments of mitochondrial TCGG 0
DNA.
The number 3.5 in the upper right was calculated by
To create a tree from the set of dissimilarities:
averaging the distances between ATCC and TCGG = 4
• First, choose the two most closely related species and and between ATGC and TCGG = 3.
insert a node to represent their common ancestor. The next cluster is {TTCG, TCGG}, with distance 2.
Finally, linking the clusters {ATCC, ATGC} and {TTCG,
• Then replace the two selected species by a set con-
TCGG} gives the tree:
taining both, and replace the distances from the
pair to the others by the average of the distances 1.5 1.5
ATCA
A→G A→T
ATCG TTCA
C→G T→C
from which we would derive the incorrect phyloge- given multiple sequence alignment, which one has
netic tree: the highest probability of generating the observed
multiple sequence alignment, under some model of
evolutionary change? The model might be specified
in terms of probabilities of mutation rates, etc., and
for the moment it would seem that a weakness of the
approach is the difficulty of knowing how to specify
A B C D the model explicitly and accurately.
Nevertheless, from any such model of evolutionary
All of the methods discussed here are subject to
change, we can compute the probability that any tree
errors of this kind if the rates of evolutionary change
would produce the observed multiple sequence align-
vary along different branches of the tree. To test for
ment. Suppose we begin the problem in a state of
varying rates, compare the species under considera-
complete ignorance, meaning that we consider that
tion with an outgroup – a species more distantly
initially – for all we know – all potential phylogenetic
related to all of the species in question than any pair
trees must be regarded as equally probable. Then
of them is to each other. For instance, if we are study-
Bayes’ rule states that we want to choose the tree
ing species of primates, a non-primate mammal such
with the highest probability of producing the
as the cow would be a suitable outgroup. If the rates
observed multiple sequence alignment.
of evolution among the primate species were con-
What makes this approach so powerful is that
stant, we would expect to observe approximately
we can optimize the probability of producing the
equal dissimilarity measures between all primate
observed data not only over possible trees, but over
species and the cow. If this is not observed, the sug-
different models of evolutionary change. This releases
gestion is that evolutionary rates have varied among
us from making overly constricting assumptions such
the primates, and the character being used may well
as constancy of molecular clock rates over different
not provide the correct phylogenetic tree.
branches of the tree, identical mutation probabilities
at all sites, etc. The calculations are nevertheless fea-
Bayesian methods sible. There is consensus that programs based on the
The problem we are trying to solve is: Bayesian approach are the most powerful tools for
Of all possible phylogenetic trees organizing the deriving phylogenetic trees from multiple sequence
relationships among different species, based on a alignments.
Evolutionary divergence arises in nature through An example with clinical application is the microbial
generation of variation by random mutation, fol- synthesis of human growth hormone. Formerly, the
lowed by either selection or genetic drift to alter allele only source of the hormone was by post-mortem
frequencies in populations or to create novel species. extraction from pituitary glands. This carried the risk
Contemporary techniques allow deliberate transfer of transmitting prion disease. Other microbiologically
of genes, to create organisms with altered characters produced human proteins with clinical applications
directly. In addition to gene therapy for disease (see include insulin, and many monoclonal antibodies.
p. 125) – and genetically modified crop plants – many Still other applications include manufacture of fuels,
applications are available or under development: or plastics, or dissolving oil spills.
In the USA the attempt to patent genetically modified
Use of microorganisms as protein factories. Micro- bacteria that could break down hydrocarbons was
organisms are routinely used in the laboratory to a landmark case, decided in favour of granting the
express proteins – human or otherwise – for research. patent by a 5–4 decision of the United States Supreme
Recommended reading 187
Court in 1980. Of course, novel varieties of flowers (a) pesticide-resistant plants, that allow treatments
produced by classical methods of breeding and selec- to kill weeds or insect pests without damaging
tion have been protectable for many years. The Inter- the plants.
national Union for the Protection of New Varieties of (b) a related approach is a plant that makes its
Plants (UPOV) is an intergovernmental organization own insecticide. Bt-corn (maize) contains a natural
established by treaty in 1961. It is now, appropri- insect-killing gene transferred from Bacillus
ately, turning its attention to biotechnology, and legal thuringiensis.
and intellectual property issues.
(c) crops with enhanced nutritional value. An
Genetically modified animals. Higher animals are also example is ‘golden rice’ enriched in vitamin A
used as protein factories, in cases where the active (see p. 245).
protein requires postranslational modifications of (d) fruit with longer shelf life, such as the ‘flavr-savr’
which microorganisms are incapable. Production of tomato.
drugs by this route is called ‘pharming’. Genetically
(e) crops that produce only sterile seeds.
engineered goats secrete an anticoagulant, human
antithrombin III, in their milk. This product has been There are a number of controversial aspects to
approved for clinical use in the United States. these activities. In the case of genetically modified
Other goals of genetically modified animals include: plants, there is concern over the spreading of genes
from the crops to undesired hosts. For instance, if
(a) enhancing the nutritional value of food; e.g.
a gene for herbicide resistance is introduced into a
pork enriched in w-3 fatty acids.
crop plant, it would make it easier selectively to kill
(b) pigs lacking the surface antigens that produce weeds without affecting the crop plant. However, it
rejection by the human immune system, as a has been observed that the gene can spread to the
source of organs for transplant. weeds. Another concern is economic. Use of sterile
(c) animals that grow faster and/or require less seeds requires farmers to purchase new seeds each
expensive feed; for example fast-growing salmon. year. It precludes traditional agricultural practice of
(d) protecting livestock against disease; e.g. cows lack- holding back a portion of a crop for replanting.
ing prion proteins and therefore immune to BSE. In addition to the specific economic implications,
(e) allergen-free pets. there is a widespread feeling that biotechnology
might alter the relationship between people and
(f) fish that glow in colours by virtue of genes for
Nature that have been a common cultural heritage
fluorescent proteins.
for thousands of years. It would be wrong to dismiss
Genetically modified plants. Many crop plants are these feelings as irrational or as characterizing only
targets for genetic modification. Goals include: a fringe.
● RECOMMENDED READING
Exercises
Exercise 5.1 On two photocopies of Figure 5.15, indicate a reasonable division of the species into
(a) three clusters; (b) five clusters.
Exercise 5.2 What is the Hamming distance between the words DECLENSION and RECREATION?
Exercise 5.3 What is the Levenshtein distance between the words BIOINFORMATICS and
CONFORMATION?
Exercise 5.4 The Levenshtein distance between the strings agtcc and cgctca is 3, consistent with
the following alignment:
ag-tcc
cgctca
Provide a sequence of three edit operations that convert agtcc to cgctca.
Exercise 5.5 To what alignment does the path through the following dot plot correspond?
T H E ●
R E T O R T ●
C O U R T E O U S
T T T T T
➙
H H
➙
E E E E
➙
● ● ●
➙
R R R R
➙
E E E E
➙
P
➙
L
➙
Y ➙
➙
● ● ●
➙
C C
➙
H H
➙
U U U
➙
R R R R
➙
L ➙➙➙
➙
I
➙
S S
➙
H H
Exercises, problems, and weblems 189
Exercise 5.6 In the dot plot appearing in Figure 5.5, there is an interruption of the matching at a
height approximately at the level of the downward-pointing arrow at the left that precedes the
words Xenopus laevis. On a photocopy of Figure 5.5(b), indicate where in the sequence this
region appears.
Exercise 5.7 How would you use a dot plot to pick up palindromic DNA sequences of the type
that appear partly on each strand, as in the specificity sites of restriction endonucleases?
Exercise 5.8 According to the BLOSUM62 matrix: (a) is a histidine (H) more likely to change to
an asparagine (N) or to an aspartic acid (D)? (b) What is the ratio of the probability that a histidine
will be observed to change to an asparagine to the probability expected on the basis of the amino
acid composition of the protein that it will change to an asparagine?
Exercise 5.9 Consider the box (red outline) showing a part of a position-specific scoring matrix
in Figure 5.8. Suppose you were scoring a protein with 225 residues according to this matrix.
(a) How many columns would you expect there to be in the position-specific scoring matrix?
(b) How many rows would you expect there to be?
Exercise 5.10 On photocopies of Figure 5.13, indicate points (a) where a pair of highly diverged
homologous proteins with similar structure but without obvious sequence similarities might lie;
(b) where a pair of non-homologous proteins with similar structure might lie; (c) where a pair of
enzymes that share a function but not a structure (for instance, serine and cysteine proteinases)
might lie.
Problems
Problem 5.1 Draw a dot plot of the following sequence from the wheat dwarf virus genome:
ttttcgtgagtgcgcggaggctttt against itself. In what respects is it not a perfect palindrome?
Problem 5.2 How would you adapt the dot plot formalism to search for regions of DNA or RNA
that form local double-helical regions? Assume that the two hydrogen-bonded regions are
separated by only a short unpaired loop, as for example in tRNA (see Figure 1.4).
Problem 5.3 (a) How might the course of the BLAST calculation shown in Figure 5.7 differ if the
word length were chosen as 3 instead of 4? (b) How might the course of the BLAST calculation
shown in Figure 5.7 differ if the word length were chosen as 7 instead of 4?
Problem 5.4 The phylogenetic tree (p. 184) is derived from a complete dissimilarity matrix,
i.e. a specification of a measure of the dissimilarity between every pair of tetranucleotides. The
numbers associated with each edge reproduce the measures of dissimilarity between connected
nodes: i.e. the sum of the edges in the path between ATCC and ATGC is 0.5 + 0.5 = 1, which
is the value in the matrix corresponding to row ATCC and column ATGC. For every pair of
tetranucleotides, calculate the sum of the numbers associated with the edges in the path between
them. For which pairs do the results agree with the original dissimilarity matrix? For which pairs
do the results disagree?
Problem 5.5 Examples in the chapter derived a phylogenetic tree for the four sequences ATCC,
ATGC, TTCG, and TCGG by the UPGMA method (unweighted pair group method with arithmetic
mean) and a phylogenetic tree for the sequences ATCG, ATGG, TCCA, and TTCA by the
maximum-parsimony method. Derive phylogenetic trees for the sequences ATCC, ATGC, TTCG,
and TCGG by the maximum-parsimony method and for the sequences ATCG, ATGG, TCCA, and
TTCA by the UPGMA method. Show all intermediate steps. Compare the results with the trees
derived in the chapter.
190 5 Evolution and Genomic Change
Weblems
Weblem 5.1 Draw a picture of the human aldolase B monomer. Indicate the sites of the mutations
N334K, L256P, A149P, N334K, and A174D. Indicate the region of the active site and the regions
of intersubunit contacts. Comment on the possible severity of the effects of these mutations on
structure and function of the protein.
Weblem 5.2 Retrieve the globin sequences shown in Figure 1.17(b). Perform a multiple sequence
alignment, and draw the phylogenetic tree. Comment on ways in which the tree seems biologically
reasonable; and – if any – ways in which it does not.
CHAPTER 6
Genomes of Prokaryotes
LEARNING GOALS
• To know the features that distinguish the major divisions of life and to appreciate how
differences of lifestyle reflect differences in genomes and structures.
• To understand the molecular basis of adaptations; for example, to life at high temperatures, or
different ocean depths.
• To appreciate, at the molecular level, the genomic and phenotypic differences among selected
related species of prokaryotes.
• To face the problem of bacterial pathogenicity, and the development of antibiotic resistance.
• To recognize the vast variety of different microorganisms that inhabit, and mutually interact in,
environmental samples. These habitats include oceans and soils, and internal environments such
as the human (or animal) gut.
192 6 Genomes of Prokaryotes
Prokaryotes have several claims on our interest. phosphorus in prokaryotes is probably ten times that
of plants. The bodies of humans and other animals
• They cause infectious diseases. Some diseases, such
harbour many microbes, but – important as the con-
as tuberculosis, are major public health problems.
sequences for health and disease may be – as an
It is a challenge to control these diseases in the face
overall reservoir we are a minor player.
of the development of antibiotic resistance.
• Molecular biologists study prokaryotes as
examples of relatively simple cells, to understand • The oceans also contain viruses in very great abun-
fundamental principles of metabolism, genetics, dance and variety. Most of these are uncharacterized.
and development. They have been called the ‘dark matter of the bio-
sphere’. It is likely that viruses are an important media-
• Historically, prokaryotes represent the earliest tor of gene transfer between marine prokaryotes.
forms of life, from which all others are derived.
They had the biosphere to themselves for over
2 billion years. The exploration of potential habitats by prokaryotes
• Prokaryotes are important mediators of ecological approaches saturation. Prokaryote cells divide
processes and geological cycles. Indeed, geological actively. Production is estimated at 1.7 × 1030 cells per
and biological phenomena are linked in an intimate year, the open ocean being the highest contributor.
marriage, which has seen its turbulent episodes. This fecundity gives prokaryotes the opportunity to
Purely geological events such as asteroid impacts evolve quickly. The resulting variety of prokaryotes
have caused mass extinctions. Purely biological includes the colonists of inhospitable habitats such
events, such as the development of photosynthetic as hot springs and very salty lakes. It also includes
processes that released large quantities of O2 into almost continuous local variations, adaptions to
the atmosphere, and respiration that released CO2, microniches (Table 6.2).
have altered general flows of matter and energy,
affecting the development of the Earth’s geochem- Major types of prokaryotes
istry and climate. Microbes respond to human-
caused environmental damage. They can aggravate C. Woese divided prokaryotes into archaea and
ecological problems, but also hold out hope of bacteria, on the basis of 16S rRNA gene sequences.
ameliorating them (Table 6.1). Figure 6.1 shows the secondary structures of a region
within the 16S rRNA that differs in bacteria, archaea,
The major habitats of prokaryotes are the open and eukaryotes. In context, Figure 6.2 shows the ter-
ocean, surface soils, and subsurface sediments beneath tiary structure of this region within the full Escherichia
both ocean and soil. The total carbon content of coli 16S rRNA structure in the ribosome.
prokaryotes is between 60 and 100% of the total Numerous other differences between archaea
carbon found in plants, but the total nitrogen and and bacteria have subsequently emerged, involving
genomic, structural, and metabolic features:
Table 6.1 Landmarks in history of life • some genes in archaea but none in bacteria contain
introns;
Formation of Earth ∼4.5 × 109 years ago
• there are systematic differences in tRNA sequences
Origin of life >3.8 × 109 years ago
between archaea and bacteria;
Cyanobacterial photosynthesis >2.7 × 109 years ago
Rise of atmospheric O2 2.3–1 × 109 years ago
• enzymes involved in DNA replication, such as DNA
polymerases and some of the tRNA synthetases
First metazoan ∼1 × 109 years ago
involved in protein synthesis, differ between
Cambrian ∼0.5 × 109 years ago
archaea and bacteria;
Evolution and phylogenetic relationships in prokaryotes 193
Habitat ×1028)
Number of prokaryotic cells (× Total carbon in prokaryotes (×1015 g)
From: Whitman, W.B., Coleman, D.C., & Wiebe, W.J. (1998). Prokaryotes: the unseen majority. Proc. Natl. Acad. Sci. USA 95, 6578–6583.
(a)
(c)
(a)
• archaea but not bacteria contain DNA-associated – bacteria and eukaryotes link the organic side-
proteins resembling histones; chains to glycerol with an ester linkage while
• membranes of all cells contain phospholipids: com- archaea prefer an ether linkage;
pounds combining a glycerol molecule with long- • cell wall structures: bacterial but not archaeal cell
chain organic molecules (see Figure 6.3); however: walls contain peptidoglycan, a combination of
– bacteria and eukaryotes build cell membranes sugar derivatives and peptides; and
from phospholipids containing d-glycerol; • archaea and bacteria differ in their complement of
archaea use l-glycerol metabolic pathways.
– the organic chains in bacteria and eukaryotes are
fatty acids, typically 16–18 carbon atoms long,
Do we know the root of the tree of life?
while archaea instead use polyisoprenes (the
branching of the isoprene chains permits the for- There is consensus that life on Earth began over
mation of links between different phospholipids 3.5 billion years ago. The earliest remaining evidence
in archaeal membranes; this allows the mem- for cellular life has the form of microfossils, called
brane to develop a higher-order structure) stromatolites, from South Africa and Australia.
Archaea 195
Archaea
The first archaea discovered lived at high tempera- • Crenarchaeota. Many but not all of these are
tures near sea-floor hydrothermal vents or in lakes thermophiles. They include Sulfolobus and
containing very high concentrations of salt, such as Thermoproteus.
the Dead Sea. However, not all archaea are adapted • Euryarchaeota. These include methanogens, sul-
to extreme environments. Indeed, there is some evi- phate reducers, and many extreme halophiles, ther-
dence that mesophilic archaea came first and that mophiles, and acidophiles, including:
thermophiles were a later adaptation. Conversely,
not all thermophiles are archaea; Thermus aquaticus – Halobacter salinarum, which can grow in salt
(the source of Taq polymerase, an enzyme in com- concentrations above 4 M! Many people find its
mon use for polymerase chain reaction amplification photosynthetic abilities even more interesting:
of DNA) is a bacterium. H. salinarum contains a bacteriorhodopsin with
Archaea are an abundant component of life in the which it captures sunlight energy as ATP with-
open ocean, making up ∼20% of all marine microbes. out involving chlorophyll.
They also associate with a variety of metazoan hosts. – Picrophilus torridus, an extreme acidophile first
The major groupings of archaea (Figure 6.4) are as isolated from the sulphurous volcanic springs of
follows. northern Japan. It can grow at pH 0.7!
196 6 Genomes of Prokaryotes
Euryarchaeota
Halobacterium
Marine Crenarchaeota
Halococcus Archaeoglobus
Natronococcus Methanobacterium Crenarchaeota
Methanocaldococcus
Halophilic methanogen Sulfolobus
Marine Methano-
Euryarchaeota thermus
Methanosarcina Pyrodictium
Methanospirillum
Thermo-
proteus
Thermoplasma
Methanopyrus Desulfurococcus
Ferroplasma Picrophilus
Thermococcus
Pyrococcus
Korarchaeota
Figure 6.4 Phylogenetic tree of archaea, based on analysis of 16S rRNA sequences. Major archaeal groupings are coloured as follows:
Euryarchaeota: Crenarchaeota:
Archaeoglobali Desulfurococcales
Halobacteria
Methanobacteria Korarchaeota
Methanococci
Methanomicrobia Nanoarchaeota (not shown)
Methanopyri
Thermococci
Thermoplasmata
BOX Methanogens as sources of greenhouse gas emission: the case of New Zealand
6.1
New Zealand is home to 4 million people, 10 million cattle, tune of $NZ11 per ton of carbon emitted. (At that time $NZ1
and 45 million sheep. The sheep and cattle host methano- ≈ UK£0.47 ≈ $US0.76.) This would amount to an annual
genic archaea in their stomachs to help to digest fodder. charge of about $NZ0.09 per sheep and $NZ0.72 per cow.
In the USA and European countries, animals make only The proposal met determined resistance from the pas-
a relatively small contribution to greenhouse gas emissions. toral community, some of it couched in surprisingly ribald
In contrast, in New Zealand, ruminant-produced methane terms. The New Zealand government ultimately aban-
accounts for approximately half of the country’s total green- doned the idea. Research into the effects of different
house gas production. When New Zealand signed the Kyoto fodders on internal flora and even antibiotics specifically
protocol, the government proposed to tax farmers to the targeting archaea is now under way.
– Methanogens. These are strict anaerobes, depen- • Korarchaeota. These were discovered by environ-
dent on the reaction: mental sampling of a hot spring in Yellowstone
National Park, Wyoming, in the western USA.
CO2 + 4H2 → CH4 + 2H2O
They are perhaps closest to the root of all archaea.
Methanogenic archaea live in the guts of
• Nanoarchaeota. These have been identified as a
ruminant animals and help to digest cellulose.
single small (∼400 nm diameter) hyperthermophile
Cellulases hydrolyse plant fodder to simple sug-
from a submarine hot vent.
ars, from which CO2 and H2 are produced by
fermentation. A cow can produce hundreds of The last two phyla are minor, at least in terms of our
litres of methane per day! (See Box 6.1.) current knowledge of them.
Archaea 197
(a)
Figure 6.6 Proteins from thermophiles and hyperthermophiles are enriched in salt bridges relative to their mesophilic homologues.
Positively charged sidechains are shown in blue. Negatively charged sidechains are shown in red. (a) Subunit of glutamate
dehydrogenase from mesophilic archaeon Clostridium symbiosum [1HRD]. (b) Subunit of glutamate dehydrogenase from
hyperthermophilic archaeon Pyrococcus furiosus [1GTM]. (c) Sequence alignment of archaeal hyperthermophilic and mesophilic
glutamate dehydrogenase subunits.
Archaea 199
(b)
Glutamate dehydrogenase
10 20 30 40 50 60
| | | | | |
Clostridium symbiosum SKYVDRVIAEVEKKYADEPEFVQTVEEVLSSLGPVVDAHPEYEEVALLERMVIPERVIEF
Pyrococcus furiosus ADPYEIVIKQLERAAQYMEISEEALEFLK___________________RPQRIVEV
V K EE L L P R E
– contain relatively fewer uncharged polar residues Also, hyperthermophiles have special ‘chaperones’
(Ser, Thr, Gln, Asn, Cys); some of these have – proteins that assist in the protein folding. It is
thermolabile sidechains (His, Gln, Thr); and likely that this is an adaptation to the challenge of
– contain higher proportions of hydrophobic high-temperature growth.
b-branched residues (see Box 6.3 and Figure 6.7).
Proteins from thermophiles and hyperthermophiles are • The enthalpy, H, represents the attractive inter-residue
enriched in amino acids with b-branched sidechains. For interactions. Attractive interactions lower the enthalpy
example, compare leucine and isoleucine (see Figure 6.7). and make H more negative. The enthalpy of the native
How does this help to achieve high-temperature stability? state is lower than that of the denatured state:
Folding of a protein to a unique native state is a compro-
ΔH = HNative − HDenatured < 0 favours folding.
mise. Attractive inter-residue interactions favour formation
of a compact native state. However, the greater conforma- • The entropy, S, represents the conformational freedom.
tional freedom of the polypeptide chain in the denatured In the denatured state, the protein molecules adopt
state favours unfolding. To stabilize the native state, the many different possible conformations, whereas in the
attractive interactions must ‘pay for’ the loss of conforma- native state, the conformations of many degrees of free-
tional freedom. dom are fixed; thus, the entropy of the denatured state
Thermodynamically, this is expressed by the criterion for is higher than the entropy of the native state:
stability:
ΔS = SNative − S Denatured < 0 favours unfolding.
G Native
−G Denatured
= ΔG = ΔH − TΔS < 0
• Because systems (at constant temperature and pressure)
where ΔH is the enthalpy change, ΔS the entropy change, come to an equilibrium state of minimum Gibbs free
and T the absolute temperature. energy (G), a protein will form the native state if and
only if a favourable ΔH overcomes an unfavourable ΔS:
ΔG = ΔH − TΔS < 0
+
NH3
Because the entropy term is weighted by T, it assumes
δ γ β α
Leucine CH3 CH CH2 C H
relatively higher importance at higher temperatures.
Table 6.3 Characteristics of T. kodakarensis, Pyrococcus abyssi, Pyrococcus horikoshii, and Pyrococcus furiosus
Genome size (bp) 2 088 737 1 765 118 1 738 505 1 908 256
G+C (mole %) 52.0 44.7 41.9 40.8
Coding sequences 2 306 1 784 2 065 2 065
1
1.8 Mb 0.2 Mb
oriC
1.6 Mb 0.4 Mb
1.4 Mb 0.6 Mb
1.2 Mb 0.8 Mb
1.0 Mb
Figure 6.8 Diagram of the genome of T. kodakarensis strain KOD1. The contents of the consecutive circles, from the outermost, are:
(1) Scale in 0.2 Mb increments, plus the predicted origin of replication (oriC).
(2) Predicted protein-coding regions in clockwise direction.
(3) Predicted protein-coding regions in anticlockwise direction.
(4) Predicted tRNA coding regions in clockwise direction (red).
(5) Predicted tRNA coding regions in anticlockwise direction (blue).
(6) Predicted mobile elements in clockwise direction (red). Lines indicate transposase genes; boxes indicate virus-related regions.
(7) Predicted mobile elements in counter-clockwise direction (blue). Lines indicate transposase genes; boxes indicate virus-related regions.
(8) G+C content (mol. %) in 10 kb window.
Circles (2) and (3) are colour coded according to function:
From: Fukui, T., Atomi, H., Kanai, T., Matsumi, R., Fujiwara, S., & Imanaka, T. (2005). Complete genome sequence of the hyperthermophilic archaeon
Thermococcus kodakarensis KOD1 and comparison with Pyrococcus genomes. Genome Res. 15, 352–363.
TRAP-type
FeoAB FbpABC FepBCD transporter MalEFGK AppABCDF SnatA, PutP, GltT
Membrane-bound
hydrogenase
H+ H+ + H2 (MbhA-N) (MbxA-N)
(E)
Figure 6.9 Reconstructed scheme of metabolism and solute transport in Thermococcus kodakarensis. Components or pathways for which no predictable enzymes could be assigned appear
in red.
Each gene product with a predicted function in ion or solute transport is illustrated on the membrane. The transporters and permeases are grouped by substrate specificity, as cations
(violet), anions (green), carbohydrates/carboxylates/amino acids (yellow), and unknown (grey).
Metabolic pathways appear in the interior of the cell: (A) glycolysis (modified Embden–Meyerhof pathway); (B) pyruvate degradation; (C) amino acid degradation; (D) sulphur reduction;
and (E) hydrogen evolution and formation of proton-motive force, coupled with ATP generation.
Abbreviations: DHAP, dihydroxyacetone phosphate; E-4-P, erythrose 4-phosphate; F-1,6-BP, fructose 1,6-bisphosphate; G-3-P, glyceraldehyde 3-phosphate; G-6-P, glucose 6-phosphate;
OAA, oxaloacetate; 2OG, 2-oxoglutarate; PEP, phosphoenolpyruvate; 3-PG, 3-phosphoglycerate; PRPP, 5-phosphoribosyl 1-pyrophosphate; R-5-P, ribose 5-phosphate; Ribu-1,5-BP,
ribulose 1,5-bisphosphate; ACS, acetyl-CoA synthetase (ADP-forming); AlaAT, alanine aminotransferase; Fbp, fructose 1,6-bisphosphatase; FDH, formate dehydrogenase; FNOR,
ferredoxin:NADP oxidoreductase; Gdh, glutamate dehydrogenase; IOR, indolepyruvate:ferredoxin oxidoreductase; KOR, 2-oxoacid: ferredoxin oxidoreductase; PflDA, pyruvate formate
Archaea
lyase and its activating enzyme; Oad, oxaloacetate decarboxylase; Pck, phosphoenolpyruvate carboxykinase; POR, pyruvate:ferredoxin oxidoreductase; Pps, phosphoenolpyruvate synthase;
Pyk, pyruvate kinase; VOR, 2-oxoisovalerate: ferredoxin oxidoreductase; AppABCDF, ABC-type dipeptide/oligopeptide transporter; CbiMOQ, ABC-type Co2+ transporter; CorA, Mg2+/Co2+
transporter; CysAT, ABC-type sulphate transporter; EriC, voltage-gated Cl− channel protein; FbpABC, ABC-type Fe3+ transporter; FeoAB, Fe2+ transporter; FepBCD, ABC-type Fe3+-
siderophore transporter; GltT, H+/glutamate symporter; Kch, Ca2+-gated K+ channel protein; MalEFGK, ABC-type maltodextrin transporter; MnhB-G, multisubunit Na+/H+ antiporter;
203
ModABC, ABC-type Mo2+ transporter; NapA and NhaC, Na+/H+ antiporter; NatAB, ABC-type Na+ efflux pump; PitA, Na+/phosphate symporter; PstABCS, ABC-type phosphate transporter;
PutP, Na+/proline symporter; SnatA, small neutral amino acid transporter; TrkAH, Trk-type K+ transporter; ZnuABC, ABC-type Mn2+/Zn2+ transporter; ZupT, heavy metal cation transporter.
From: Fukui et al. (2005) (see Figure 6.8).
204 6 Genomes of Prokaryotes
1
P. abyssi 1,765,118
Bacteria
Bacteria form the other division of prokaryotes that preclude any neat solution in terms of a simple
(Figure 6.11). Bacteria have been known for much phylogenetic tree.
longer than archaea (A. van Leeuwenhoek dis- Figure 6.11 suggests one recent approach to classi-
covered bacteria in 1676). In consequence, bacterial fying the main groups of bacteria. Box 6.4 gives
taxonomy bears a considerable load of historical some examples of better-known organisms in the
baggage. It has required genomes to sort out the different groups, with some brief comments. For a
phylogenetic relationships. However, the genomes classification of bacteria focusing on pathogens, see
also show large amounts of horizontal gene transfer http://www.microbialrosettastone.com/.
Firmicutes Bacilli, staphylococci, Listeria and staphylococci can be infectious; Clostridia can cause food
lactobacilli, Clostridia poisoning; lactobacilli are useful in yoghurt production
Actinobacteria Micrococcus, Streptomyces Decompose dead plant material; source of antibiotics
Fusobacteria Fusobacterium nucleatum Live in human gut, involved in periodontal infection
Thermotogae Thermotoga subterranea Thermophilic or hyperthermophilic; some are anaerobic
Thermus Thermus aquaticus Thermophilic; source of Taq polymerase
Deinococci Deinococcus radiodurans D. radiodurans is unusually radiation resistant
Chloroflexi Chloroflexus aurantiacus Photosynthetic, but do not produce O2; may provide clue to early
development of photosynthesis
Cyanobacteria Prochlorococcus marinus Chlorophyll-based photosynthesis; most split H2O and produce O2;
give rise to chloroplasts via symbiosis
Spirochaetes Leptospira, Borrelia Some are pathogenic (leptospirosis, syphilis, Lyme disease)
burgdorferi, Treponema
pallidum
Fibrobacters Fibrobacter intestinalis Live in gut; help cattle to digest cellulose
Chlorobium Chlorobium tepidum Green sulphur bacteria; photosynthetic: reduce sulphide to sulphur
Bacteroidetes Bacteroides fragilis Some are marine plankton; others are anaerobic, live in the gut and
can cause infection. Porphyromonas gingivalis causes gum disease
Chlamydiae Chlamydia trachomatis Grow intracellularly; major cause of blindness; also cause sexually
transmitted infections of the urogenital system
Aquificales Aquifex aeolicus Extremophiles, autotrophs
d-Proteobacteria Desulfovibrio desulfuricans Mostly aerobic; some anaerobic examples reduce sulphur or sulphate
a-Proteobacteria Rhodospirillum rubrum, Rhizobium are symbiotic with legumes and fix nitrogen; Rickettsia
Rhizobium, Rickettsia cause typhus; give rise to mitochondria via symbiosis
b-Proteobacteria Burkholdia, Bordetella, Some live on inorganic nutrients; others are infectious, different
Thiobacillus, Neisseria species causing pertussis, gonorrhoea and meningitis
g-Proteobacteria Escherichia coli, Haemophilus Important in medicine and molecular biology: cause enteritis, typhoid,
influenzae, Pseudomonas bubonic plague, and others
aeruginosa, Yersinia pestis,
Salmonella typhimurium
206 6 Genomes of Prokaryotes
that the genome of K12MG1655, 4639221 bp, is 85% of the genome. There is a single plasmid con-
shorter than that of 0157:H7, 5528445 bp. Regions taining about 25 000 bp. Genes for enhanced anti-
amounting to 4.1 Mb are common to both strains. biotic resistance are encoded by a transposon inserted
The unshared genes tend to cluster in strain-specific into the plasmid. Comparison of the sequences – in
regions. Strain-specific regions in K12MG1655 con- particular, observation of lack of synteny – has made
tain 1.34 Mb, 1387 genes, and in 0157:H7 contain it clear that the development of methicillin resistance
0.53 Mb, 528 genes. was not a single event, producing a clone that was
It is likely that the strains diverged about 4.5 mil- subsequently selected. Instead, the resistance ele-
lion years ago. A clue to the origin of the differences ments were acquired many times by many strains, via
between the strains is the atypical base composition horizontal gene transfer.
of the strain-specific regions. This suggests that they A comparison of the sequences of many S. aureus
entered the respective genomes by horizontal gene strains, encompassing different clinical phenotypes,
transfer. showed that 78% of genes were common to all
strains, including isolates from cow and sheep. The
remaining 22%, that are at least partially strain-
• Horizontal gene transfer is a common theme in devel-
specific, tend to be localized within 18 large regions
opment of virulence and antibiotic resistance.
of difference (RDs), ranging from 3–50 kb long.
Figure 6.12 shows the presence of these regions in the
Helicobacter pylori. Half the world’s population is different strains, and the correlation of the pattern
infected with H. pylori. One out of 10 people develop with methicillin resistance.
clinical disease: gastritis, duodenal and gastric ulcers,
and some cancers. Proof that H. pylori infection is
Genomics and the development of vaccines
the cause of ulcers was obtained – over the disbelief
of the scientific and medical establishments at the Genomics and recombinant technology have made
time – by Barry Marshall, who swallowed a culture possible a new generation of approaches to vaccine
of H. pylori, and quickly developed symptoms of design.
gastritis. A vaccine against hepatitis B virus is expressed in
H. pylori strains are very diverse (they have been yeast cells. It is a surface antigen, a viral envelope
applied to tracking of patterns of human migration). protein, the gene for which was cloned into yeast. A
Three strains have been sequenced completely. Strain vaccine against Bordetella pertussis (the causative
26695 contains about 1.7 Mbp, and about 1550 agent of whooping cough) is based on the toxin, a
genes. Other sequenced strains differ by about 6%. multi-subunit protein. By genetic engineering, the
Virulence appears to be associated with a common molecule was completely detoxified by introduction
Cag 40 kb pathogenicity island containing >40 genes. of mutants, which removed the enzymatic activity
The appearance of genes within this island is corre- but left the immunological properties intact. That is,
lated with virulence. This pathogenicity island is antibodies raised against the detoxified form protect
common to many bacteria. It is likely that it has been against the native protein.
circulated by horizontal gene transfer. A more general approach to vaccine design involves
comparing the genome sequences of pathogenic and
Staphylococcus aureus. S. aureus infections are a nonpathogenic strains to identify virulence factors
growing clinical problem because of the aggressive that might serve as the basis of vaccines.
development of antibiotic resistance. (The develop- Neisseria meningitidis serogroup B is the major
ment of resistance to vancomycin – the ‘antibiotic of cause of meningitis and septicaemia in children and
last resort’ – is discussed in Chapter 9.) young adults. From the 2 272 351 bp genome sequence,
The genomics of S. aureus has been pursued vigor- computational methods predicted 2158 genes. Algo-
ously, in order to identify the mechanisms of develop- rithms predicted that 600 of them would be on the
ment and spread of resistance. The S. aureus genome cell surface or secreted. These were candidates for
is about 2.8–2.9 Mb long. Assignment of approxi- vaccines. Of these, 350 were expressed in E. coli, and
mately 2600 open reading frames accounts for almost tested in mice for an immune response that produced
Bacteria 207
MSA3410
MSA3426
MSA3400
MSA3405
MSA2120
MSA2965
MSA2348
MSA2020
MSA2389
MSA1601
MSA2099
MSA3412
MSA3407
MSA2885
MSA2335
MSA2754
MSA2345
MSA1836
MSA1827
MSA2786
MSA3095
MSA2346
MSA1205
MSA1832
MSA3418
MSA3402
MSA1695
MSA890
MSA817
MSA961
MSA820
MSA551
MSA535
MSA700
MSA537
RF122
COL
RD1
RD2
RD3
RD4
RD5
RD6
RD7
RD8
RD9
RD10
RD11
RD12
RD13
RD14
RD15
RD16
RD17
RD18
Figure 6.12 Results of comparison of sequences of 36 isolates of S. aureus. RD = regions of difference, localized segments of the
genome of high variability among strains. Filled squares indicate RD present; empty squares, RD absent. Hatched squares correspond
to methicillin-resistant strains. Red indicates isolates of electrophoretic type 234, the predominant type causing toxic shock syndrome.
From: Fitzgerald, J.R., Sturdevant, D.E., Mackie, S.M., Gill, S.R., & Musser, J.M. (2001). Evolutionary genomics of Staphylococcus aureus: Insights into
the origin of methicillin-resistant strains and the toxic shock syndrome epidemic. Proc. Nat. Acad. Sci. USA 98, 8821–8826. Reproduced by permission.
The development of resistant strains of pathogens presents A related prospect is to turn to biology as well as chem-
a severe challenge to medicine. There is consensus that novel istry to discover new therapeutic agents, including revival
antibiotics will be needed. Some are already in the ‘pipeline’, of an old suggestion of using bacteriophages clinically.
currently in the clinical testing phase. However, the research A number of small biotech companies have started up,
that produced today’s ‘new’ antibiotics was initiated in the funded by venture capital, to try to explore a number of
early 1990s and pharmaceutical companies are reducing non-traditional avenues. Given the decade ‘lead time’
their emphasis on antibiotic research. It is likely that fewer required for a new drug to make its way from the labora-
new antibiotics will emerge in the current industrial cli- tory to approval in clinical use, the process must be set
mate. The paradox of a growing need for new discoveries in motion immediately. A problem with this approach is
coupled with the reduction in resources aimed at generat- a line of recent US court decisions that impose stricter
ing them creates a problem that may become a crisis. criteria on the patentability of procedures that might be
The novel developments associated with genomics, considered ‘natural processes’.
the subject of this book, can in principle contribute to the The problems are multidimensional, involving science,
development of antibiotics. It is possible to identify targets economics, long-term forecasting, and regulatory and pat-
– specific proteins, essential for a pathogen, that differ ent law. Each field presents an individual set of difficulties.
from mammalian proteins sufficiently to suggest that drugs These difficulties are compounded by the necessity to solve
against these proteins would be effective against the them all simultaneously in the face of both genuine con-
pathogen but non-toxic to mammals. Unlike classical anti- flicts between different goals, and boundaries between
biotic research practice, experimental methods are now professions that impede communication and cooperation.
available that can define the mechanism of action of a drug There is, however, consensus that the problems must be
while it is still under development. solved.
208 6 Genomes of Prokaryotes
Classically, microbiologists studied prokaryotes by Sea. A group led by J.C. Venter sequenced 109 non-
growing them in culture, isolating pure strains for redundant regions. Many novel sequences were
detailed study. Powerful as the methods were, and found, although it is difficult to assemble complete
useful as they were for clinical applications and genomes and avoid chimaeras.
research, they were also blinders that prevented full
appreciation of the variety and interactions of species
Marine cyanobacteria – an in-depth study
in natural environments. DNA sequencing has made
it possible to: The basic concepts of genome evolution are secure:
organisms explore genome variations. Adaptations
• clarify evolutionary relationships;
drive some changes – via divergence, allelic redistri-
• use high-throughput sequencing methods to study bution within a population, gene loss, or gene acqui-
a cross-section of the life in a natural sample; sition by horizontal gene transfer. Neutral genetic
• study the majority of strains that are difficult to drift accounts for other changes, especially in small
grow in culture; and populations.
• appreciate the relationships and interactions What is more difficult to understand is how popul-
among different species that share an ecosystem. ations make choices and adopt strategies. If organisms
encounter environments varying in space and/or time,
into how many populations, or even new species, will
• A millilitre of ocean water may contain 100–200 spe-
they split? Which proteins will diverge – in sequence,
cies. A gram of soil may contain 4000.
in function, or in expression pattern? What novel
genes are needed and where will they come from?
From natural samples containing complex mix- In most field situations, the ‘topology’ of evolution-
tures, it is possible to amplify and determine ary space depends on a complicated interaction of
sequences directly, without culturing individual physical and ecological variables, such as geographic
strains. The molecule of choice has been 16S rRNA. barriers imposed by landscape, climate, and inter-
This is partly because of its traditional role as a mol- species cooperation or competition. Complex envir-
ecule that varies at the appropriate rate to distinguish onments give rise to complex biological communities.
ancient phylogenetic branching patterns. In addition, The distribution of cyanobacteria in the open
rRNA is not very prone to horizontal gene transfer. oceans in temperate regions offers a relatively simple
It thereby preserves the distinctions between taxa – ecological context. The distributions – both of envir-
perhaps, however, this disguises the mixing that has onmental features and of species or strains – depend
taken place with other genes. Another disadvantage on a single variable, depth. There are many correlates
of characterizing an organism by its rRNA is that of depth: light intensity and quality, temperature,
rRNA does not reveal any details of the metabolism pressure, ultraviolet light penetration, nutrient avail-
or other adaptations of the species. ability (notably, sources of nitrogen and iron), and
An example of metagenomics is the sequencing of the occurrence of predators and viruses. Yet, for all
16S rRNA genes from ocean water from the Sargasso its complexity, the system is one-dimensional.
Metagenomics: the collection of genomes in a coherent environmental sample 209
How are different Prochlorococcus strains adapted Ultraviolet light can damage DNA. Major products
to differences in ambient light intensities and include thymine dimers, the linkage of adjacent thy-
spectral distributions? mine residues in DNA. In response to the threat of
mutation, cells contain repair enzymes, including
Effective interactions with light require both efficient
photolyase, which recovers thymines from dimers.
energy transduction and protection from photochem-
Because ultraviolet light does not penetrate far into
ical damage caused by excitation energy spillover.
sea water, Prochlorococcus strain MED4, living nearer
• Antenna complexes of photosystem II (PSII) of the surface, is in greater danger from photochemical
most cyanobacteria, including Synechococcus, damage than MIT9313. Indeed, MED4, but not
contain protein complexes called phycobilisomes. MIT9313, contains a gene for photolyase. Another
Prochlorococcus is unusual among cyanobac- difference between the strains is also probably related
teria in using, as PSII antennae, proteins binding to photo-oxidative stress: MED4 contains perhaps
unusual modified (divinyl) chlorophylls, called Pcb twice as many high-light-inducible proteins as
proteins. MIT9313. From their distribution in the genome,
Where did the Pcb proteins come from? They some of these appear to have arisen by recent dupli-
appear to have been recruited from a family called cation events.
Metagenomics: the collection of genomes in a coherent environmental sample 211
* See: http://www.es.flinders.edu.au/~mattom/IntroOc/notes/figures/fig5a5.html.
212 6 Genomes of Prokaryotes
● RECOMMENDED READING
Exercises
Exercise 6.1 Which of the differences between archaea and bacteria, described on pp. 192–194,
could you derive from genome sequence alone?
Exercise 6.2 Using the standard genetic code (Box 1.1), are there any amino acids for which a
hyperthermophilic organism could not preferentially choose a codon with G or C in the third
position?
Exercise 6.3 From the table in Box 6.4, what are two groups of bacteria that are (a) photosynthetic,
(b) live in the human gut, (c) pathogenic?
Exercise 6.4 On a photocopy of Figure 6.6, circle salt bridges that appear at (approximately)
common positions in both structures.
Exercises, problems, and weblems 213
Exercise 6.5 On a photocopy of Figure 6.7, identify positions at which there is a charged residue
in the P. furiosus sequence but an uncharged residue in the C. symbiosum sequence. (Charged
residues = K, R (shown in blue), D, E, H (shown in red) uncharged residues = all the others.)
Exercise 6.6 On a photocopy of Figure 6.9, (a) circle the glycolysis/gluconeogenesis pathway;
(b) circle the Calvin cycle.
Exercise 6.7 On a photocopy of Figure 6.10, circle a region in which synteny is maintained in
P. abyssi, P. horikoshii, and P. furiosis.
Exercise 6.8 Why is the recombinant vaccine against hepatitis B expressed in yeast and not
E. coli?
Exercise 6.9 If 1 ml of seawater contains 200 species, the metagenome would be how big?
Problems
Problem 6.1 Draw a Venn diagram showing the numbers of genes specific to one, common to
pairs, and common to all three of the following: Prochlorococcus strain MED4, Prochlorococcus
strain MIT9313, and Synechococcus strain WH8102.
Problem 6.2 In Figure 6.12, hatched squares indicate MRSA strains. (a) Find two RDs (regions
of diversity) that are present in all methicillin-resistant strains studied. (b) Find two RDs that are
present in some methicillin-resistant strains, but not every strain containing either of them is
methicillin resistant. (c) Is there any RD that is present in every methicillin-resistant strain, and
absent from every methicillin-sensitive strain?
Weblems
Weblem 6.1 Print out the complete secondary structures of 16S rRNAs from E. coli,
Methanococcus vannielii, and Saccharomyces cerevisiae (http://www.rna.icmb.utexas.edu/).
On each structure, indicate where the region illustrated in Figure 6.1 appears.
Weblem 6.2 Are any archaea implicated in human disease?
Weblem 6.3 (a) Identify a bacterium with a growth temperature below 10°C. (b) Identify an
archaeon with a growth temperature below 10°C. (See Figure 6.5.)
Weblem 6.4 Where did Prochlorococcus get the chlorophyll-binding proteins of its photosystem II
antenna? (a) Using either of the Prochlorococcus Pcb proteins (UniProt ID PCBA_PROMM or
PCBA_PROMP), search using PSI-BLAST for homologous cyanobacterial proteins with different
functions. You should find – among others – ISIA_SYNP6, ISIA_SYNP7, and ISIA_SYNP2. Present
the results of this search, editing the output down to the most relevant information. (b) What
is the function of the ISI family of proteins? (c) Align the full sequences of PCBA_PROMM,
PCBA_PROMP, ISIA_SYNP6, ISIA_SYNP7, and ISIA_SYNP2, using CLUSTAL W or T-Coffee.
Comment on the extent of the divergence. (d) Determine the fraction of identical residues
between every pair of sequences from the multiple sequence alignment and draw a
phylogenetic tree.
Weblem 6.5 The high-light-intensity-adapted Prochlorococcus strain MED4 contains a
phycoerythrin, but the low-light-intensity-adapted strain MIT9313 does not. Do most
Synechococcus strains contain phycoerythrins? If so, align the sequences of a prochlorococcal
and synechococcal phycoerythrin and comment on the extent of the divergence compared
with the divergence of prochlorococcal Pcb proteins from synechococcal ISI proteins (see
Weblem 6.4).
This page intentionally left blank
CHAPTER 7
Genomes of Eukaryotes
LEARNING GOALS
• Having a clear sense of how eukaryotic cells differ from prokaryotic cells.
• Understanding the relationships among the major types of eukaryotes.
• Recognizing that fungi in general and yeasts in particular are among the simplest eukaryotes,
and have served as model organisms in the molecular biology laboratory (in addition to
applications in agriculture and cooking).
• Knowing the unique features of higher plants, and the focus on Arabidopsis thaliana – ‘the fruit
fly of botany’.
• For the animals, appreciating the evolutionary path from early organisms, more distantly related
to humans, to mammals.
• Respecting the power – albeit limited – of the ability to recover and sequence DNA from extinct
organisms.
We, like all other species that exist or have existed, are the product of a long evolutionary history.
Genomes give us snapshots of landmarks along the route. They allow us to understand the
topology of the path – where and when it branched. They reveal when certain features arose. In
some cases they show us the experiments – some productive, some abortive – that preceded the
mature result subsequently adopted.
216 7 Genomes of Eukaryotes
There is consensus that eukaryotes are descended Until relatively recently, our view of the history of
from prokaryotes. Evidence includes the observations life was limited to organisms that have left either
that both prokaryotes and eukaryotes use the same descendants or fossils. The Burgess Shale deposits
genetic code, and share many metabolic pathways. show us that we have missed a lot of interesting alter-
When did eukaryotes originate? There is consensus natives. It is true that it has become possible to
that prokaryotes had the world to themselves for recover and sequence ancient DNA. This has opened
many years since the origin of life (dated at no later a window on extinct species. But only in a limited
than 3.5 billion years ago). The first eukaryotic fossil way.
is approximately 2 billion years old. The discovery Our only real possibility of reconstructing our his-
in datable oil deposits of biochemicals produced tory is through the genomes of extant organisms that
only by eukaryotes suggests an earlier origin, almost have been around for a long time, and to read the
3 billion years ago. story that their sequences tell.
The first eukaryotes began as unicellular organisms.
Multicellularity began with formation of colonies, • In order to reconstruct evolutionary history from
followed by cell specialization, perhaps first by sim- genomes, we must analyse genomes from species, the
ple symbiosis. Developmental programmes allowed ancestors of which first arose in the distant past. What
specialization within clonal clusters. do they share with their close relatives, that might offer
There followed exploration of a great variety of genomic characterization of a group of species; for
body plans, based on a variety of tissue types. Major instance, what are the defining genomic features of
landmarks, that we know about, in the development vertebrates? What do ancient species share with their
precessors and successors? What innovations did they
of higher organisms, include the division between
achieve? Which of those were dead ends and which
animals and plants. Some animals adopted body
did other species descended from them adopt and
structures with bilateral symmetry. Some of these
develop?
became vertebrates. Eventually, some became human.
Genome sequences provide detailed information origin of some features of a body plan, or the immune
about evolutionary relationships among species. So system, or the endocrine system, or features of the
many genomes are now known that we must pick nervous system. Phrasing the questions loosely: Who
and choose only a few of the interesting examples. invented them? When? What if anything did they
The Genomes On Line Database (GOLD) lists 2007 come from? What alternatives were experimented
completed or ongoing eukaryotic genome sequencing with and how well did they work?
projects. This is an ambitious programme. It is appropriate
Figure 7.1 shows the major groups of eukaryotes. to begin with one of the simplest eukaryotes, yeast.
Note the ‘star’ topology – at this level of low resolu-
tion, eukaryotes are a bush rather than a tree. In this
The yeast genome
chapter, however, we shall follow a more directed
path, roughly in the direction towards higher com- Yeast, like E. coli, is an organism better known to
plexity. This correlates fairly well with date of origin. many of us from molecular biology labs than from
From the comparative genomics of eukaryotes, we nature. It has served as a model eukaryote, because
can ask questions about features of humans that we of its relative simplicity, ease of growth, having both
consider essential. For instance, we can look into the haploid and diploid states, and safety. Of course we
Evolution and phylogenetic relationships in eukaryotes 217
CH
Plants
RO
(landplants, many algae) Alveolates coding RNAs. A single tandem array on chromo-
M
AL
(dinoflagellates)
some XII encodes 120 copies of ribosomal RNA.
VE
O
LA
There are 40 genes for small nuclear RNAs, and 275
TE
Rhizaria
S
(foraminifera) Stramenopiles
(diatoms, brown algae) genes for transfer RNA, about one-third of which
contain introns.
Of the protein-coding genes, 4777 correspond
Amoebozoa
(amoebas) * * to molecules to which a function can be assigned.
Discicristates About 1000 more contain some similarity to known
(trypanosomes)
proteins in other species. Another ∼800 are similar to
Opisthokonts Excavates I ORFs in other genomes that correspond to unknown
(sponges, fungi, animals) (diplomonads) proteins. Many of these homologues appear in pro-
Figure 7.1 Major classes of eukaryote, with example species in karyotes. Only ∼1/3 of yeast proteins have identifiable
parentheses. The asterisks mark possible positions of the root of homologues in the human genome.
the tree. The classification of yeast protein functions shown
From: Baldauf, S.L. (2003). The deep roots of eukaryotes. Science 300, in Table 7.1 is taken from the Saccharomyces genome
1703–1706, and personal communication.
Angiosperms
Bryophytes (flowering plants)
Quaternary Ferns Ginkos Conifers Cycads Eudicots Monocots
Cenozoic
(mosses)
Tertiary
Cretaceous
Mesozoic
Jurassic
Triassic
Permian
Carboniferous
Devonian
Palaeozoic
Silurian
Ordovician
Figure 7.2 Phylogeny of land plants. The picture is limited to groups with extant examples with which readers may be familiar.
Monocots and eudicots are named for the difference between single and double cotyledons in the embryo, but many other features
separate them.
database, http://mips.gsf.de/genre/proj/yeast/Search/ algae. Green algae have now been split into strep-
Catalogs/catalog.jsp. tophytes and chlorophytes; streptophytes are related
to higher plants.
Plants came ashore to occupy land environments
The evolution of plants
about 450 million years ago, in the mid-Ordovician.
Plants and animals parted company a long time ago. Most plants today are angiosperms, or plants with
Although all life forms share much of their molecular flowers (see Figure 7.2). Angiosperms arose about
biology, plants derive energy from sunlight via photo- 140–190 million years ago.
synthesis. The consequences for their structure, The first complete nuclear genome of a higher
lifestyle, and developmental programmes have been plant to be sequenced was that of Arabidopsis thaliana,
profound. They require many proteins dedicated to common name thale cress. It is related to turnip, cab-
their unique biophysical and metabolic activities. bage, and broccoli. Its ease of handling and rapid
Plant genome sequences illuminate the similarities to generation time have made it a favoured subject for
and differences from other eukaryotes: research in plant molecular biology. A. thaliana has
been called ‘the fruit fly of botany’.
• Plants share some functions with animals. At the
genomic and proteomic level, are they achieved in The Arabidopsis thaliana genome
similar ways?
A. thaliana has a relatively small genome – 146 Mb
• Some functions are unique to plants. At the – distributed over five chromosomes (see Box 7.1).
genomic and proteomic level, where did they come (The maize genome is almost 20 times as large.)
from? Were they invented, adapted, or borrowed?
From a common ancestor living about 800 million
years ago, metazoa split into three major groups: fungi,
• The compact genome was one reason why the research
animals, and plants. Higher plants evolved from
community adopted Arabidopsis.
single-celled organisms formerly classified as green
Evolution and phylogenetic relationships in eukaryotes 219
• Chordates have a notochord, a rudimentary cartila- Figure 7.4 Adult Ciona intestinalis.
ginous skeleton running dorsally from head to tail. Cristian Cañestro, C., Bassham, S., & Postlethwait, J.H. (2003). Seeing
A nerve cord lies parallel, adjacent, and dorsal to the chordate evolution through the Ciona genome sequence. Genome Biol.
notochord. Vertebrates retain the notochord during 4, 208 (Photo by Andrew Martinez).
early embryonic development (vertebrates are also
chordates) but replace it with the spinal cord.
repertoires. The genes are tightly packed (7.5 kb/
gene, compared to 100 kb/gene for humans).
Because of the position of Ciona in the evolution-
The C. intestinalis genome has approximately
ary tree, it is of interest to analyse its genes according
160 Mbp, about 1/20 the size of the human genome.
to where homologues exist.
From the initial sequence determined, approximately
16 000 proteins were adduced. This is comparable • Almost 60% of the genes have homologues in
to invertebrates, and lower than typical vertebrate C. elegans and/or D. melanogaster. These represent
Evolution and phylogenetic relationships in eukaryotes 221
genes shared by species ancestral to both inverte- The genome of the pufferfish (Tetraodon
brates and chordates. nigroviridis)
• A few genes look more similar to genes from
Tetraodon nigroviridis is a freshwater pufferfish.
worm and/or fly than to vertebrate genes. It is
Ancestors of humans and fish parted company
likely that these are vestiges of the common
450 million years ago. Comparing the genome of
ancestor lost in the lineage leading to vertebrates.
T. nigroviridis with other vertebrate genomes should
An example is the gene for haemocyanin, the
therefore reveal defining properties of vertebrates.
oxygen carrier in many invertebrates. Ciona also
The T. nigroviridis genome is about 340 Mb in
contains genes for four globins, the vertebrate oxy-
length, an order of magnitude smaller than the
gen carrier.
human. Contributing to the compactness are a rela-
Some haemocyanins also have enzymatic activ-
tive paucity of repetitive transposable elements, and
ity, as phenoloxidases, enzymes which convert
shorter introns and intergenic regions. Forty per cent
monophenols to diphenols, and/or diphenols to
is protein coding! Approximately 28 000 protein-
o-quinones. These reactions are involved in invert-
coding genes have been identified.
ebrate immune responses: a defence reaction acti-
The T. nigroviridis genome strikingly illuminates
vates phenoloxidase activity, producing reactive
the large-scale structure of vertebrate genomes. First,
quinones, which contribute to the inactivation of
there is evidence for whole-genome duplication.
foreign organisms.
Chromosome rearrangements have complicated
Indeed, one system carefully looked for in Ciona,
what might originally have been a simple pattern.
but conspicuous by its absence, is an adaptive
There remains, however, a considerable degree of
immune system. Ciona does not appear to have
synteny between pairs of groups of paralogous genes
genes for immunoglobulins, T-cell receptors, or
on different chromosomes. These common syntenic
MHC proteins. This appears then to be a later
blocks within the T. nigroviridis genome arose by
invention, by vertebrates. The characteristic mole-
whole genome duplication followed by chromosome
cules are absent from even the primitive jawless
rearrangement.
vertebrates – lamprey and hagfish.
Second, it is possible to map syntenic groups
• A fifth of the genes have no apparent homologue between T. nigroviridis and human. Figure 7.5 shows
in vertebrates or invertebrates. It is likely that reciprocal maps. Consider Figure 7.5a. Imagine each
homologues will be discovered – perhaps when the human chromosome coloured a constant separate
protein structures are determined – or they may colour; for example, colour human chromosome 2
be very highly diverged within the urochordate pink. Then for each gene on human chromosome 2,
lineage. find homologues on T. nigroviridis chromosomes,
Some Ciona proteins carry out functions specific and colour them pink also. The large blocks of
to urochordates. The urochordate body is sur- pink in T. nigroviridis chromosome 2 indicate long
rounded by a ‘tunic’, made of fibrous cellulose-like blocks that are syntenic with human chromosome 2.
polysaccharide. (Tunicates is another name for this Of course, some of human chromosome 2 appears
group.) Ciona contains enzymes for synthesis of elsewhere in the T. nigroviridis karyotype; for
cellulose, and endogluconases, which degrade cel- instance, at the top of T. nigroviridis chromosome 3.
lulose. Of course cellulose as a structural material Figure 7.5b is the reciprocal map: colour the T. nigro-
is unusual in organisms other than plants and viridis chromosomes a solid colour, and map them
bacteria. There is evidence that the last common onto the human set.
ancestor of urochordates acquired the cellulose It cannot be seen in Figure 7.5 directly, but
synthase gene by lateral transfer from bacteria. typically one human region aligns with two regions
Cellulose degradation is more widespread. The in T. nigroviridis. The explanation is whole-genome
endogluconases of Ciona are most similar to duplication in T. nigroviridis but not in human.
homologues in animals that digest cellulose, such Figure 7.6 shows this in more detail. Here Hsa =
as termites and some cockroaches. Homo sapiens; human chromosomes are numbered
222 7 Genomes of Eukaryotes
K1 U1 I1 K3 D1 IB I2 K4 W1
Hsa1
J1 A1 J7 U2 D2 D3 V1
Hsa2
K5 Z1 K6 V2 K7 W2 I4
Hsa3
Z2 U2 A2 B2
Hsa4 Tni1
L2 A3 H1 Tni2
Hsa5 Tni3
L3 K8 J2 U3 J3 W4 AncA Tni4
Hsa6
AncB Tni5
C1 D4 L4 G1 F1 L5 C2 F2 L8
Hsa7 AncC Tni6
Tni7
Z3 A4 L6 AncD
Hsa8 Tni8
AncE
W3 A5 Tni9
Hsa9 AncF
Tni10
F3 U4 U5 D5 B4 D11 AncG
Hsa10 Tni11
E1 E2 H3 Z4 G3
AncH Tni12
Hsa11 AncI Tni13
F4 K10 F6 A7
AncJ Tni14
Hsa12
AncK Tni15
U6 G4 D6
Hsa13 Tni16
AncL
J5 Tni17
Hsa14 AncU
Tni18
J8 E3 AncV Tni19
Hsa15
AncW Tni20
C8 E4
Hsa16 AncZ Tni21
G5 C9 G6 C10 G7 C11
Hsa17
L9 A8 L7
Tni13
Hsa18 Hsa16
Tni5
I5 I8 I7 E5 U7 Z5
Hsa19
K11 U8 K13
Hsa20 Tni1
D7 G8 D8 HsaX
Hsa21 Tni7
A9 C14 F8
Hsa22
D9 H4 K12
HsaX
Figure 7.6 More-detailed mapping of synteny blocks between T. nigroviridis and human chromosomes. The two ‘blown-up’ regions
show matches between individual genes. Notice that in general two T. nigroviridis regions map to one human one, evidence for whole
genome duplication. In the detailed regions there is an alternation of matches, arising from random loss of one copy of each pair of
genes produced by the whole genome duplication. Hsa = Homo sapiens; human chromosomes are numbered Hsa1–Hsa22 plus HsaX.
Tni = T. nigroviridis. Anc = ancestral vertebrate. (Blocks AncU, AncV, AncW, and AncZ contain small amounts of sequence that could
not be assigned to the twelve ancestral chromosomes.)
From: Jaillon, O., Aury, J.M., Brunet, F., Petit, J.L., Stange-Thomann, N., Mauceli, E., et al. (2004). Genome duplication in the teleost fish Tetraodon
nigroviridis reveals the early vertebrate proto-karyotype. Nature 431, 946–957.
224 7 Genomes of Eukaryotes
the minimal set of rearrangements that accounts for Why is the chicken genome smaller than the human
the entire pattern. Reversing those rearrangements, and other mammalian genomes? The chicken genome
on paper of course, provides a sketch of the ancestral is relatively poor in interspersed repeats and pseudo-
vertebrate chromosomes (see Figure 7.6). The sug- genes. Expansion of many gene families is greater in
gestion is that the ancestral vertebrate genome was mammalian genomes.
distributed on 12 chromosomes, and had the reper- To try to extract a ‘core’ set of vertebrate proteins,
toire of ∼20 000–30 000 protein-coding genes that comparison of human, pufferfish (Takifugu rubripes),
are typical of extant vertebrates. and chicken genomes exposed a common gene set.
Approximately 7000 protein-coding genes from
chicken have orthologues in both pufferfish and
• The pufferfish is a vertebrate. At least three-quarters of human. These are likely to implement common neces-
its genes have human homologues. By comparison of sary functions, and one expects to find them in
its genome with humans, it is possible to reconstruct
most higher vertebrates. These common genes are
the ancestral vertebrate karyotype.
expressed in many different tissues. This is another
typical signature of a gene that is not rapidly evolving
for a lineage-specific function.
The chicken genome
The next step in this line of enquiry would be to
The chicken genome was the first complete sequence determine which of these genes are also expressed in
of a bird. It is a useful outgroup for study of the com- primitive vertebrates and invertebrates. This will define
parative genomics of mammals. The lineages leading a higher-vertebrate common core gene repertoire.
to birds and mammals diverged about 310 million Comparing chicken and human only, about 60%
years ago. of the ∼23 000 chicken protein-coding genes have
The chicken is also important as an animal raised unique human homologues. Whereas such pairs
for food. The consumption of chickens in the UK between human and mouse or rat show an average of
amounts to 2.5 million birds per year, and about 88% sequence conservation, between human and
11 billion eggs. Chickens were domesticated from chicken this drops to 75.3%. Proteins with different
the grey jungle fowl (G. sonneratii) in Asia, about classes of function differ in conservation with human
8000 years ago. Consequently, the genetics of chickens homologues: transport proteins are more highly con-
has been studied extensively. There are very many served than average, and proteins of the immune
different breeds; for example, some are specialized response show only 60% sequence conservation.
for egg production, others for meat. It is anticipated The chicken has some avian-specific proteins.
that the genome will have practical applications in These include one family of keratins, which in
food production. chickens form feathers; mammals have expanded a
The chicken has also been a popular laboratory different family, to form hair. Chickens have genes
animal. It has contributed to research in develop- for avidin, a protein appearing in the egg whites of
mental biology, virology, immunology, and cancer. reptiles, amphibians, and birds. The very strong
At ∼1.2 Gbp, the chicken genome is substantially binding of biotin to avidin (KD ≈ 10−15 M) has been
smaller than most mammalian genomes. However, applied in the laboratory for purification. What is
it contains approximately the same number of genes. its natural function? It is thought to protect eggs
The chicken has 38 autosomes and one pair of sex from bacteria, which require free biotin as a cofactor
chromosomes, called Z and W. Like other birds, in numerous reactions.
but different from mammals, females are hetero- Conversely, some human genes that chickens lack
gametic (ZW) and males are homogametic (ZZ). are:
About 90% of the 1.05 Gb of assembled sequence
was anchored to its proper chromosome location. • milk proteins such as casein
It is interesting that the synteny between human and • enamel proteins, associated with loss of teeth in
chicken is more conserved than between human the bird lineage subsequent to Archaeopteryx, a
and mouse. primitive bird that did have teeth
Evolution and phylogenetic relationships in eukaryotes 225
• vomeronasal receptors. The vomeronasal system that encodes a transcription factor. The platypus
is a secondary chemosensory or odour detection lacks this gene.
system, that appears in many vertebrates, includ- The closest to an extant reptile–mammal transi-
ing humans, fish, reptiles, and others. Its absence tional form that we have, the platypus has offered us
from chicken and other birds signifies a loss in the chance to see how the basic distinctive features of
the avian lineage, rather than an invention in the mammals originated.
mammalian one. The platypus’s was the first monotreme genome
sequenced. It has 2.2 billion base pairs. 18 527 protein-
coding genes were identified. Not unexpectedly, a
The platypus genome (Ornithorhynchus majority of these have orthologues in opossum (a
anatinus) marsupial), human, dog, and mouse (placentals), and
Extant mammals form a class divided into three even chicken. Of particular interest are genes not
orders: found in other mammals. Like its anatomy, the platy-
pus genome shows a mixture of mammalian and
monotremes: only the platypus and two species of non-mammalian features. These include:
echidna
Odour receptors: the platypus odorant receptor
marsupials: kangaroos, opossums, koalas, and many
genes are for the most part recognizable homologues
others, including all mammals native to Australia
of those in other mammals. The repertoire is more
and New Guinea
akin to that of other mammals than to reptiles. There
placentals: all other mammals, including humans.
are roughly half the number of odorant-receptor
The platypus (Ornithorhynchus anatinus) is as dis- genes as in other mammals, but this may possibly be
tant a relative as we have among mammals. Startling a reflection of the animal’s aquatic lifestyle.
to its discoverers, and to us now, is its mixture of Milk: although true milk is unique to mammals,
mammalian and reptilian characteristics. Like mam- non-mammal animals that incubate their eggs secrete
mals, the platypus has hair, and nurses its young (it fluids that protect eggs from desiccation and/or infec-
has mammary glands, but not teats – the milk is tion. However, unlike those primitive precursors,
released through localized pores in the skin, modified platypus milk resembles that of other mammals. It is
sweat glands). Like reptiles it lays eggs; and has a complex mixture with both nutritive and anti-
venom, delivered by males through ankle spurs. microbial functions.
Careful study of the anatomy revealed many other
Eggs: unlike the eggs of marsupials and placental
unusual characteristics. For instance, the name
mammals, which are nourished internally, the platy-
monotreme (= single aperture) refers to the single
pus lays eggs that contain yolk. Common to the yolks
orifice serving both the urogenital and digestive
of eggs of fish, amphibians, reptiles, birds, and most
systems.
invertebrates is the protein vitellogenin. Vitellogenin
An unusual sensory capacity is electroreception,
is the precursor of the lipoproteins and phosphopro-
the ability of the platypus to perceive electrical
teins that are major protein components of egg yolk.
impulses. The platypus can locate and catch prey
However, vitellogenins are not restricted to eggs. For
through use of a combination of mechano- and
instance, bees use it as food store also. Vitellogenin
electroreceptors in its bill. A platypus will attack a
genes were lost in the lineages leading to marsupials
battery immersed in water in the dark.
and placental mammals, and retained in the mono-
Study of the molecular biology of the platypus
tremes. In other mammals, the placenta became the
revealed other surprises, including 10 sex chro-
locus of embryonic development, and the mother
mosomes (males are always XYXYXYXYXY).
supplied nutrients. Monotremes have a primitive
However, the sex determination system is closer to
form of placenta, called a yolk-sac placenta.
that of birds than of most mammals. In marsupials
and placental mammals, the primary locus for sex Venom: venom is one of those ideas that has proved
determination is SRY, a gene on the Y chromosome useful to a variety of species, including monotremes
226 7 Genomes of Eukaryotes
vDLPs
Therian -defensins
vCrotasins
vCLPs
-defensin lineages
Lineage 1
Lineage 2
Lineage 3
Lineage 4
Lineage 5
Lineage 6
Figure 7.7 Evolutionary tree and points of gene duplication of defensins in birds, reptiles, platypus, and therians (= marsupial + placental
mammal). Defensins are a group of families of small proteins found in a variety of vertebrate and invertebrate species. The therian
molecules are not components of venom. They have antibacterial activity, generally functioning by forming pores within the microbial
cell membrane, allowing cell contents to leak out.
A class of venom defensin-like proteins (vDLPs) has arisen independently in several lineages, including reptiles and platypus. In
Crotalus snake venomes, the vDLPs are neurotoxins affecting voltage-gated sodium channels. The mechanism of action of the vDLP
in platypus venom is still unknown.
From: Warren, W.C., Hillier, L.W., Graves, J.A.M., Birney, E., Ponting, C.P., et al. (2008). Genome analysis of the platypus reveals unique signatures of
evolution. Nature 453, 175–183.
Primates (human, chimp) • Because many drugs are tested in dogs, under-
Euarchontoglires
standing of their molecular biology is useful. Dogs
Rodents (mouse, rat)
have also been used in research on gene therapy.
Lagomorphs (rabbit) • Dogs show a vast morphological variation, not-
ably in size. Information about genetic regulation
Laurasiatheria of developmental pathways is implicit in the com-
(dog, cat, horse, cow, whale)
parative genomics of different breeds. For instance,
Xenarthra (armadillo,
anteater) a single mutation controls breadth of skull and
Afrotheria (aardvark) shortness of face. In humans, mutation in the
Marsupials (kangaroo) homologous protein is responsible for Treacher
Monotremes (platypus) Collins syndrome, a developmental disorder affect-
Figure 7.8 Phylogeny of mammals, showing monotremes and ing the skull and face.
marsupials (green) and the four major groups of eutherian Different breeds also vary generally in personal-
mammals: Euarchontoglires, Laurasiatheria, Xenarthra, and ity traits, providing an opportunity to identify
Afrotheria (blue). Human, chimpanzee, mouse, and rat are all genes for aggressiveness and passivity.
Euarchontoglires. Dogs belong to the Laurasiatheria. Complete
genome sequences are known for species shown in red.
History of the dog
The order Carnivora, to which domestic dogs and
cats belong, originated during the Palaeocene, ∼60 mil-
Were those not sufficient reasons for interest in the
lion years ago (see Figure 7.9). Dog-like carnivores
dog genome, the biology of the dog presents numer-
are known from fossils from 40 million years ago.
ous scientific challenges and opportunities.
The current closely related species – wolf, coyote,
• The dog is an outgroup of other mammals for jackal, and red fox – split off about 3–4 million years
which complete genome sequences have been ago. The wolf lineage gave rise to the domesticated
determined (see Figure 7.8). dog, Canis familiaris.
• Dogs are an ideal species in which to study domes- Domestication of dogs is recorded in archaeolo-
tication. To a far greater extent than other genera, gical artefacts 14 000–15 000 years old, but probably
dogs and their relatives offer both a variety of took place much earlier. The first colonists of North
inbred populations – the different breeds – and America, who came across the Bering Strait about
corresponding wild populations. The genomes of 20 000–15 000 years ago, brought domesticated
dogs and wolves are much closer than those of dogs with them.
humans and chimpanzees. The sequence diver- Evidence from the genome suggests that dogs
gences in chromosomal DNA between wolves and went through two population bottlenecks. The first
dogs is 0.04% in exons and 0.21% in introns. occurred ∼9000 generations ago (∼27 000 years)
Unlike humans and chimpanzees, dogs and wolves upon domestication. The second, ∼30–90 genera-
can interbreed. tions ago, signals the origin of breed divergence.
There are now about 300–1000 breeds of dogs. The
American Kennel Club recognizes 150 as genetically
Romulus and Remus, founders of Rome, were, according
to tradition, suckled by a wolf. separated populations, with closed gene pools.
• Dogs share many human genetic diseases. Many Genome variation among breeds of dogs
are specific to individual breeds, and good genea- The most complete canine genome is that of Tasha,
logical and clinical records are available. The a female boxer. Her genome was determined by the
breeds are highly inbred: many have small founder shotgun method, with 31.5 reads providing ∼7.5-fold
populations and some have gone through bottle- coverage (Table 7.3).
necks. This simplifies the search for the gene or The dog genome is slightly smaller than that of
genes responsible for the disease. humans, in part because dogs have fewer repeat
228 7 Genomes of Eukaryotes
Nandinia
Felidae
Feliformia
Viverridae
Hyaenidae
Herpestidae
Malagasy carnivorans
Canidae
Ursidae
Caniformia
Phocidae
Odobenidae
Ailurus
Mephitidae
Musteloidea
Procyonidae
Basal/other mustelids
Martes group
Mustelidae
Mustela
Lutrinae
Figure 7.9 Domestic dogs and cats fall into the two main suborders of the order Carnivora. The two lineages split about 48 million
years ago. Note that the pictures of the animals are not drawn to scale.
© 2005 From: Flynn, J.J., Finarelli, J.A., Zehr, S., Hsu, J., & Nedbal, M. (2005). Molecular phylogeny of the Carnivora (Mammalia): assessing the
impact of increased sampling on resolving enigmatic relationships. Syst. Biol., 54, 317–337. Reproduced by permission of Taylor & Francis Group, LLC
(http://www.taylorandfrancis.com).
sequences. The short, interspersed element (SINE) is were determined from 11 breeds of dogs. Given the
shorter in dogs than in humans (1 500 000 copies of inbred nature of the individual breeds, it is not
SINEs make up 13% of the human genome). surprising that fewer SNPs appear when compar-
In addition to determining the reference sequence, ing individuals within breeds than in comparisons
2.5 million single-nucleotide polymorphisms (SNPs) amongst different breeds. Comparing individual
Palaeosequencing – ancient DNA 229
Table 7.3 The dog genome • Longer haplotype blocks. Within breeds, haplo-
type blocks may be as long as 100 kb. Haplotype
Feature Value Comment blocks shared by different breeds are about 10 kb
Number of 39 pairs More than long. In comparison, the length of haplotypes in
chromosomes humans modern humans is about 20 kb.
Genome length 2.4 × 109 bp Slightly less than • Linkage disequilibrium within breeds extends over
humans
several megabases. Across all breeds it is greatly
Number of proteins 37 774 reduced, extending only over tens of kilobases.
identified
Comparison of dog, human, and mouse genomes
Dogs have 39 pairs of chromosomes compared with
23 pairs in human and 21 in mouse. Therefore,
boxers, there is ∼1 SNP/1600 bases. Between breeds,
the human and mouse chromosomes must have been
there is ∼1 SNP/900 bases.
reassorted to make up the dog karyotype. Neverthe-
Concomitant consequences of closer relationships
less, 94% of the dog genome appears in conserved
within breeds are:
synteny blocks with the human and mouse genomes.
• Greater interbreed than intrabreed sequence differ- Approximately 5% of the dog genome constitutes
ence. Over the entire species, dogs and humans show functional elements common to dog, human, and mouse.
similar levels of nucleotide diversity between indi- This is higher than the protein-coding fraction of the
viduals: a frequency of different bases of ∼8 × 10−4. human genome. It includes regulatory elements and
However, the genetic homogeneity is much greater non-protein-coding RNAs, and further suggests that
within breeds of dogs than within distinct human it is premature to dismiss as ‘junk’ the regions of the
populations. genome to which we cannot yet assign function.
Recovery of DNA from ancient samples exclude liquid water. This can occur if the samples
are frozen, or desiccated by heat, or if sequestered
The recovery and sequencing of DNA from extinct within compartments such as teeth, bone, or hair.
species offers us a window onto evolutionary history. Because DNA from ancient samples is usually pres-
Source material includes fossils collected from their ent in only microscopic quantities, contamination is
deposits, mummified samples, eggshells in the sub- a serious danger. Contaminants can be microbial, or
fossil state (it was even possible, by testing DNA human from the scientists handling the specimens.
from the outer surface of moa shells, to show that Even without contamination, samples suffer from
male birds were responsible for incubating the eggs), fragmentation, and from chemical change resulting
preserved seeds or other plant material, specimens in sequence changes. The most common is deamina-
from museums, and clinical collections of pathogens. tion of cytosine to uracil.
For instance, there are repositories of samples of However, advances in isolation methods can
influenza virus dating back almost a century. reduce further damage during the extraction phase,
Although often only minuscule amounts of material and careful technique can exclude contamination by
are available, PCR amplification can produce reason- scientist DNA. Paleosequencers must ‘rough it’ dur-
able quantities for sequencing. ing field work, but apply unusually painstaking care
Like other biological material, DNA degrades after in handling samples. A speciality practised by few,
death, unless it is preserved. The best protection is to but of interest to many.
230 7 Genomes of Eukaryotes
Emeus
Figure 7.12 Reconstructions, classification, and estimated geographical distributions of species of moa, extinct flightless birds from
New Zealand.
From: Bunce, M., Worthy, T.H., Phillips, M.J., Holdaway, R.N., et al. (2009). The evolutionary history of the extinct ratite moa and New Zealand
Neogene paleogeography. Proc. Nat. Acad. Sci. USA 106, 20 646–20 651. Copyright (2009) National Academy of Sciences, USA.
232 7 Genomes of Eukaryotes
The data on divergence of sequences is consistent ficantly smaller than their mainland counterparts. Such
with the historical geology. The suggested scenario is substantial changes can obscure taxonomic positions.
that the divergence of the major groups took place on Sailors stopping at the island found the dodo easy
the South Island, after the alps formed. When the prey – it could not fly, and lacked appropriate fear
land links arose, during glaciations, birds began to of human predators. As a result, the dodo became
inhabit the North Island, taking advantage of the extinct. The last survivor was shot in 1681.
new surroundings to generate a new round of diver- The solitaire (Pezophaps solitaria) was a related
gence (Figure 7.12). bird from a neighbouring island east of Mauritius,
Rodrigues. It also became extinct, outliving the dodo
by perhaps a century.
The dodo and the solitaire
Even museum specimens of the dodo are rare. The
The dodo (Raphus cucullatus) was a large, flightless Oxford University Museum of Natural History had
bird that inhabited the island of Mauritius, in the one. It was seen by don Charles Dodgson, who used
Indian Ocean, east of Madagascar (Figure 7.13). it as a character in Alice in Wonderland. Only par-
It was a large, robust bird, about a metre in height tially saved from a fire during a tidying-up exercise,
and weighing about 20 kg (about three times the size the Oxford specimen is the only known source of soft
of a typical Thanksgiving turkey). It is common for tissues from a dodo. What remains now comprises
island species to be either significantly larger, or signi- a head, and a leg and foot, each with some skin
attached. There are many bones on the tropical
island, but preservation conditions on the tropical
island are not conducive to preservation of DNA. It
was the Oxford remnants that provided the material
for DNA sequencing.
From the Oxford sample of the dodo, and samples
from Rodrigues of the solitaire, it was possible to
amplify and sequence short overlapping fragments
between 120 and 180 bp. For comparison, corres-
ponding sequences were analysed from many extant
species of putative relatives, including various species
of pigeons and doves. From the extant species,
sequences were determined of 1.4 kb of mitochon-
drial DNA, and regions of the genes for 12S ribo-
somal RNA (360 bp) and for cytochrome b (1050 bp).
A phylogenetic tree constructed from these data
fixed the taxonomic position of the dodo. It is a
Figure 7.13 Mauritius dodo. pigeon, of the family Columbidae. The closest extant
From: ‘A German Menagerie Being a Folio Collection of 1100 Illustrations relative of the dodo and solitaire is the Nicobar pigeon
of Mammals and Birds’ by Edouard Poppig, 1841. (Caloenas nicobarica), which lives in Southeast Asia.
Mammoths are unusual among extinct organisms the south-eastern corner of Lake Taymyr. Extraction
in the favourable conditions of their Arctic habitats from 1 g of bone yielded ∼0.73 mg DNA.
for preservation of DNA. Even better than most Fragments of the DNA attached to small sepharose
specimens was a ∼28 000-year-old jawbone, found beads were amplified in lipid vesicles by PCR. Six
on the shore of Baikura-turku, a bay extending from runs of a Roche/454 Life Sciences Genome Sequencer
High-throughput sequencing of mammoth DNA 233
20 System produced a total of 1 943 593 reads, with The mammoth nuclear genome
average length of ∼95 bp.
The mammoth nuclear genome presented a harder
The DNA in the sample contained about 50%
problem. It is estimated to be approximately 4.17 Gbp
mammoth DNA, a mixture of nuclear and mito-
in length, longer than the human genome.
chondrial; the rest was bacterial contaminant. Using
DNA was extracted from hair samples from a
reference sequences to sort out mammoth sequences
Siberian animal, denoted M4, that died about 20 000
from bacterial sequences, and mammoth nuclear
years ago. It is interesting that because the DNA is
sequences from mitochondrial ones, the aggregate
fragmented, the relatively short read lengths of the
harvest was ∼95 Mb of mammoth sequence. This
sequencer were not a great problem. The average
included 7.3-fold coverage of the 16 770 bp mito-
read length produced was about 150 bp. This yielded
chondrial DNA. The remainder gave a partial view
3.6 Gb of sequence. Combining this with additional
of the mammoth nuclear genome (∼3%).
sequences determined from other individuals pro-
In addition to what it tells us about the mammoth,
duced a total of 4.17 Gb of sequence. Calibration
the significance of this work is its demonstration of
of error rates suggest an average of 6 errors out of
the power of (1) the latest instrumentation and (2)
10 000 bases arising from DNA damage, and
resequencing in determining the organelle genome.
8/10 000 from sequencing.
There is no need for selective amplification of the tar-
The genome of the African elephant (Loxodonta
get sequence. Instead, it is possible to assemble the
africana) provided a reference sequence. The L.
mitochondrial DNA from the very large quantity of
africana genome had been sequenced at 7× coverage,
data available, given the scaffolding available from
and assembled. Alignment of the reads from the
a related sequence, in this case that of the Indian
mammoth samples, to L. africana and to other
elephant. Note that, with over sevenfold coverage,
genomes representing potential contaminants, showed
the reference sequence is not really needed for the
that over 90% of the reads were mammoth DNA, for
assembly, but rather for identifying the reads corres-
a total of 3.3 Gb of mammoth sequence.
ponding to mitochondrial DNA.
It was possible to compare amino acid sequences
Indeed, computational experiments have shown
of mammoth proteins with the orthologues in ele-
that a reasonably good assembly is possible using as
phants and other species. The results suggest that
a reference sequence the mitochondrial DNA of the
mammoth and African elephant differ, on average, in
dugong, a distant relative. Ancestors of mammoths
one residue per protein. It is difficult to assign selec-
and dugongs diverged around 65–70 million years
tive or even functional significance to these, in general.
ago. Mammoth and dugong mitochondrial DNA
Even in the cases of residues unique to mammoth,
sequences are only 75.3% identical. In practice, a
compared to a wide spectrum of other placental
reference sequence from a closer relative than the
mammals, the sequence is only the starting point
dugong is to mammoth would in most cases be avail-
for investigation of the proteins thereby identified as
able. Therefore the dugong–mammoth assembly
interesting candidates for follow-up studies.
offers a ‘worst-case’ analysis.
For the sake of argument, suppose one’s interest
were limited to the mitochondrial DNA sequence. The phylogeny of elephants
One might choose to amplify the mitochondrial DNA
Access to DNA from extinct species has allowed
and sequence that only. Sequencing fragments from
resolution of two problems in elephant phylogeny:
all of the DNA is, comparatively, very inefficient
in its use of the data produced, as far as deter- (a) How many species of African elephants are
mining the mitochondrial sequence is concerned. there?
However, comparisons with other ancient-DNA There are two populations of elephants in
sequencing projects suggest that it is efficient in Africa, living in the savannah and in the forest.
terms of the amount of precious sample used. For Some authorities have described them as separ-
the study of extinct species, this is an overriding ate species: Loxodonta africana (savannah) and
consideration. L. cyclotis (forest). Others have considered them
234 7 Genomes of Eukaryotes
a single species, or regard L. cyclotis as a sub- species! Nevertheless, the conclusion is that mam-
species of L. africana. moths are more closely related to Asian elephants
(b) Are mammoths more closely related to African than to African elephants, and that L. africana and
elephants or Indian elephants (Elephas maximus)? L. cyclotis should be considered as separate species.
● RECOMMENDED READING
• General discussions of topics in eukaryotic evolution, many in the form of collections of papers:
Hirt, R.P. & Horner, D.S., eds. (2004). Organelles, Genomes and Eukaryote Phylogeny: An
Evolutionary Synthesis in the Age of Genomics. CRC Press, Boca Raton, FL, USA.
• On June 29, 2006, The Royal Society held a discussion meeting, Major steps in cell evolution:
palaeontological, molecular, and cellular evidence of their timing and global effects. The meeting
was organized by T. Cavalier-Smith, M. Brasier, and T.M. Embley, and published in Philosophical
Transactions of the Royal Society B, volume 361, issue 1470.
Katz, L.A. & Bhattacharya, D. (2006). Genomics and Evolution of Microbial Eukaryotes. Oxford
University Press, Oxford.
Baldauf, S.L. (2008). An overview of the phylogeny and diversity of eukaryotes. Journal of
Systematics and Evolution 46, 263–273.
Telford, M.J. & Littlewood, D.T.J., eds. (2009). Animal Evolution / Genomes, Fossils and Trees.
Oxford University Press, Oxford.
• An atlas of life forms, showing phylogenetic relationships and dates of divergence:
Hedges, S.B. & Kumar, S. (2009). The Timetree of Life. Oxford University Press, Oxford.
Exercises, problems, and weblems 235
Exercises
Exercise 7.1 What fraction of the intergenic space in the nuclear genome of A. thaliana is
occupied by transposons?
Exercise 7.2 On a photocopy of Figure 7.2, mark the approximate dates of the duplications in the
A. thaliana genome on the branch leading to eudicots.
Exercise 7.3 Calculate the average gene density in the nuclear, mitochondrial, and chloroplast
genomes of A. thaliana.
Exercise 7.4 Describe how you would test the assertion that the last common ancestor of
urochordates acquired the cellulose synthase gene by lateral transfer from bacteria. What
sequence information would you gather, and how would you analyse it?
Exercise 7.5 In Figure 7.5(b), which T. nigroviridis chromosomes have substantial regions of
synteny with human chromosome 10?
Exercise 7.6 On a photocopy of Figure 7.5(a), indicate the regions in which T. nigroviridis
chromosomes 5 and 13 contribute to human chromosome 15.
Exercise 7.7 Figure 7.12 shows that specimens of Euryapteryx curtus appear on both North and
South Islands. On which island is it likely that the species arose? What reasoning leads you to this
conclusion?
Exercise 7.8 In the dog, the variation in mitochondrial DNA sequences is lower than the variation
in nuclear DNA sequences. What does this suggest about the breeding behaviour of domesticated
dogs?
Exercise 7.9 For which of the following domesticated species could a population survive if
released into the wild? Dog, cat, chicken, parakeet, maize, rice, and wheat.
Exercise 7.10 It is much easier to study mitochondrial DNA than nuclear – it is smaller, and more
abundant in cells. What is the danger of assigning phylogeny by comparing populations through
sequencing mitochondrial DNA of various individuals, in species that are matrilocal (that is, females
remain within a herd, males leave)? An example was a study of elephants, based on mitochondrial
DNA sequences. What criticism might be raised?
Problems
Problem 7.1 Figure 7.14 shows an alignment of globins from Ciona intestinalis and human
haemoglobin a and b, myoglobin, cytoglobin, and neuroglobin. Some N- and C-terminal
extensions have been trimmed. (a) Which pair of globins has the largest number of identical
residues in this alignment? (b) Are the Ciona globins more similar to one another than the human
globins are to one another? (c) Which human globin do the Ciona globins most resemble?
236 7 Genomes of Eukaryotes
Figure 7.14 Alignment of globins from Ciona intestinalis and human haemoglobin a and b, myoglobin, cytoglobin, and neuroglobin.
Some N- and C-terminal extensions have been trimmed.
Weblems
Weblem 7.1 (a) Has at least one organism from each of the major eukaryote classes shown in
Figure 7.1 been the subject of a full-genome sequencing project? For each class, give an example
of such a species if possible. (b) For which eukaryotic phyla has at least one species been the subject
of a full-genome sequencing project? For each phylum, give an example of such a species if possible.
Weblem 7.2 Find a yeast protein in each of the first ten functional categories in Table 7.1.
Weblem 7.3 What is the latest common ancestor of the human and the aardvark? (Hint: compare
the full taxonomy listings in any entry of a human and an aardvark sequence.)
Weblem 7.4 The UniProtKB entry for chicken ovocleidin 116 is Q9PUT1_CHICK. This protein is
eggshell-specific in chickens. Does it have any mammalian homologues?
Weblem 7.5 Mammals excrete nitrogenous waste in the form of urea. Birds excrete uric acid. Urea
in mammals is formed by the urea cycle, a metabolic pathway that forms urea (H2NC(=O)NH2+)
from ammonia and aspartate. The first enzyme in this pathway is carbamoyl phosphate synthetase
1, which catalyses the reaction:
2ATP + HCO3– + NH4+ → 2ADP + H2NC(=O)OPO32− (carbamoyl phosphate) + Pi
(a) Does the chicken contain a homologue of human carbamoyl phosphate synthetase I?
(b) Is there reason to believe it is not functional? (c) In what tissues in chicken is it expressed?
(d) What might be the function of this enzyme in the chicken?
CHAPTER 8
LEARNING GOALS
• To understand the science underlying the use of genomics for personal identification.
• To recognize what characteristics of an unknown individual can be inferred from a sample
of blood or saliva, and that the use of these inferences in criminal investigation remains
controversial, with different jurisdictions adopting different regulations.
• To appreciate how mitochondrial DNA sequences were used to identify the remains of the
Russian royal family.
• To see the domestication of crop plants as experiments in directed genome change, and to
appreciate that we can now analyse the genetic differences between the wild progenitor and
the varieties used in contemporary farming.
• To consider the relationship between humans and Neanderthals, based on a sequencing of the
Neanderthal mitochondrial and nuclear genomes.
• To understand past patterns of human migration, as reflected in mitochondrial DNA haplotypes.
A prominent theme in our presentation of genomics has been the potential for applications that
improve the health of humans, animals, and plants. In this short chapter we collect a few
applications of genomics to some of the other human sciences.
238 8 Genomics and Human Biology
Legal applications of DNA sequencing depend on child’s DNA was a combination of those of the
several scientific facts: parents (see Figure 8.1).
Why do different individuals give different pat-
1. The genomes of all individuals except identical
terns of restriction fragment sizes? One possible
siblings are unique. Like fingerprints, genomes
cause of the difference is a mutation in a restriction
provide a unique personal identification. A blood-
site, causing that site not to be cleaved. In this case,
stain at a crime scene, like a set of fingerprints,
two fragments from unmutated DNA will corres-
can be traced to a specific individual.
pond to a single longer fragment from the mutated
2. The genome of every person combines chromo- sample. (In terms of our analogy between restriction
somes from his or her parents. Unlike fingerprints, maps and distances between consecutive Starbucks
therefore, genomes can indicate familial relation- cafes on Broadway in New York City, imagine the
ships; notably, identification of paternity. effect on the pattern if one of the cafes were to close
3. Each person’s genome contains genes that influ- (see p. 90.) Alternatively, somewhere between two
ence, even if they do not inevitably determine, restriction sites there may be a short repetitive stretch
recognizable features, such as eye colour. In of DNA, the number of copies of which is unstable
principle, DNA left by an unknown individual during replication, where the polymerase ‘stutters’.
at a crime scene could be analysed to suggest a Expansion of such a repeat will lengthen the restric-
physical description of the source individual. tion fragment in which it appears. The fragment will
occupy a different position on a gel that separates
4. Thus, unlike fingerprints, genomes contain much
fragments according to size.
more information about a person than simple
Such a short repetitive segment of DNA is called a
identification. The treatment of this information
variable number tandem repeat (VNTR). VNTRs are
by governmental authorities raises ethical and
legal questions. We have already raised some of
these questions in Chapter 1. M F C
generally flanked by recognition sites for the same Jeffreys also applied his method to criminal identi-
restriction enzyme, which will neatly excise them, fication. In the first case, DNA fingerprinting proved
producing fragments of different lengths (see Box 8.1). the innocence of a suspect who had actually confessed
It is these fragment lengths that vary between indi- to two crimes. The true criminal was discovered after
viduals, known as restriction fragment length poly- a survey of DNA samples from almost 4000 people
morphism (RFLP). The fragments can be separated living in the region. In this single case, DNA finger-
on a gel according to size and detected by Southern printing proved both the innocence of a man under
blotting. arrest, in serious danger of conviction and punish-
ment; and the guilt of the real criminal.
• VNTRs are characteristics of genome sequences; RFLPs
A substantial number of persons convicted and
are artificial mixtures of short stretches of DNA created sent to jail before Jeffreys’ discovery have subse-
in the laboratory in order to identify VNTRs. quently been proved innocent by analysis of samples
saved from the evidence presented at their trial.
Despite its successes, identification by gel separa-
The patterns were easy to determine from a sample tion of RFLPs has disadvantages in practice. It
of DNA. Jeffreys and his co-workers quickly estab- requires relatively large amounts of undegraded
lished that they were unique to individuals, providing DNA (10–50 ng of material no shorter than 20 000–
a ‘genetic fingerprint’. 25 000 bp). Since the development of PCR, DNA-
The first legal application, in 1985, was to a case based identification methods have tested for the
of disputed identification involving a family of UK presence of selected regions known to vary in the
citizens. A child in the family visited Ghana. When he population, using PCR to amplify those present. This
returned to the UK, immigration authorities suspected greatly improves sensitivity. Subnanogram amounts
him of being an impostor, not entitled to UK residency. suffice to identify 100 bp regions. It is possible to get
None of the classical blood tests, including A, B, AB, a positive identification from a single hair (of a per-
O, and other blood groups, and even MHC haplotyp- son, or in one case of the cat of a criminal’s parents),
ing – which gives much higher discrimination – pro- or from the saliva on a licked envelope.
duced definitive results. Indeed, there was a possibility The method in common use now is to PCR amplify
that the boy was related to the woman who claimed a short tandem repeat (STR) typically containing
to be his mother, but perhaps he was her nephew 2–5 bp, repeated between a few and a dozen times.
rather than her son. Quite fine distinctions were Amplification produces fragments about 200–500 bp
therefore essential. Jeffreys’ DNA fingerprints, com- long. Loci in common use show 5–20 common
paring the patterns from the child’s DNA with that alleles, and 8–15 loci are tested.
of members of the UK family, proved his identity to Jeffreys has recently introduced a newer identi-
the satisfaction of the Home Office. The family were fication method based on a single VNTR locus (see
reunited. Problem 8.1).
↓ ↓
HaeIII 5′...GGCC...3′ HinfI 5′...GANTC...3′
3′...CCGC...5′ 3′...CTNAG...5′ Mitochondrial DNA
↑ ↑
Human mitochondrial DNA is 16 569 bp long.
It contains a hypervariable 100 bp region, which
240 8 Genomics and Human Biology
varies by 1–2% between unrelated individuals. The • The trial of O.J. Simpson for the murder of
mitochondrial DNA of unrelated people typically his wife; he was acquitted despite presentation
differs at eight positions. Mitochondrial DNA is of evidence by the prosecution that he was the
very abundant and survives very well. It was used source of fresh bloodstains found at the scene of
to identify the remains of the Russian royal family the crime.
(see Box 8.2). • A stain on a White House intern’s dress provided
evidence against US President William J. Clinton.
Gender identification • Comparison of DNA from descendants of early
19th-century US President Thomas Jefferson and
It is possible to decide whether a nuclear DNA sam-
Sally Hemmings, a slave on his Virginia planta-
ple came from a male or female. Obviously, detection
tion, proved Jefferson to be the father of Hemmings’
of any sequence unique to the Y chromosome will
children.
prove male origin. Another technique in common use
applies the appearance of different versions of the Less sensational, but more important in everyday law
gene for angiogenin on the X and Y chromosomes. enforcement, is the fact that DNA evidence is suffi-
The X version contains a 6 bp deletion. PCR amplifica- ciently definitive and widely accepted to avoid many
tion of this region from a female will give one band trials, by not indicting innocent people.
from the two identical X copies of the gene; DNA Applications of DNA identification techniques to
from a male will give two bands, one from the X and animals include the proof of claims that Dolly the
one from the Y. sheep was indeed a clone, testing of horses and dogs
DNA identification has provided evidence in to confirm breeders’ claims of pedigrees, testing of
several very high-profile cases that readers will be commercial whale meat to check for endangered
familiar with. species, and even a suggestion of creating a database
BOX Identification of the remains of the family of Tsar Nicholas II from analysis of
8.2 mitochondrial DNA
For most of us, all of our mitochondria are genetically iden- remains of the Tsarina were proved by matching the mito-
tical, a condition called homoplasmy. However, in some chondrial DNA sequence with that of a maternal relative,
individuals, different mitochondria contain different DNA Prince Philip, Chancellor of the University of Cambridge,
sequences; this is called heteroplasmy. Such sequence Duke of Edinburgh – and grandnephew of the Tsarina.
variation in a disease gene in the mitochondrial genome (Prince Philip’s shared maternal line with Alexandra means
can complicate the observed inheritance pattern of the that in principle his chances of suffering from haemophilia
disease. were 12.5%.)
The most famous case of heteroplasmy involved Tsar However, comparisons of mitochondrial DNA sequences
Nicholas II of Russia. After the revolution in 1917, the Tsar of the putative remains of Nicholas II with those of two
and his family were taken into exile in Yekaterinburg in maternal relatives revealed a difference at base 16 169: the
Central Russia. During the night of 16–17 July 1918, the Tsar had a C and the relatives a T. Extreme political and
Tsar, Tsarina Alexandra, at least three of their five children, even religious sensitivities mandated that no doubts were
their physician, and three servants who had accompanied tolerable. Further tests showed that the Tsar was hetero-
the family were killed and their bodies buried in a secret plasmic; T was a minor component of his mitochondrial
grave. When the remains were rediscovered, assembly of DNA at position 16 169. To confirm the identity beyond
the bones and examination of the dental work suggested any reasonable question, the body of Grand Duke Georgij,
– and sequence analysis confirmed – that the remains brother of the Tsar, was exhumed and was shown to have
included an expected family group. The identity of the the same rare heteroplasmy.
The domestication of crops 241
to identify dogs whose owners do not clean up after Y-chromosome sequences. Police could deduce, from
them in municipal parks. a sample left at a crime scene, whether the source
individual was named Sykes or not (unless, of course,
he changed his name).
Physical characteristics
Other possible analysis of a blood sample left at a
Suppose a sample containing DNA is collected at crime scene might provide an estimate of the time of
a crime scene, and there is reason to believe that a day of deposition. Certain chemicals, for instance
criminal deposited it. It is possible to use the sample melatonin, vary in concentration in blood and saliva
for identification. But suppose the source individual following regular circadian rhythms. Such tests do
is not represented in the forensic databanks, and is not involve DNA. DNA methylation correlates with
not one of the suspects – usual or unusual – rounded age. It might not be too fanciful to imagine a police
up. It is still technically feasible to make some infer- investigator asking:
ences about the person the police are looking for. Now then, Grandfather Sykes, where were you at
It is possible to predict certain physical charac- 11 pm last night?
teristics from analysis of DNA sequences. Gender,
obviously, but also colour of hair, eyes, and skin,
and ethnic background. Use of these inferences for • In addition to matching a DNA sample with an indi-
suspect profiling is controversial. vidual, it is possible to analyse crime-scene samples
to infer several characteristics, including eye and hair
In some cases, it is possible to infer the source
colour, complexion, and ethnicity. Use of these infer-
individual’s family name! Oxford don Brian Sykes
ences in criminal investigation remains controversial,
discovered that all males named Sykes in the UK
and there is substantial variation in what different
are descendants of a single founder individual, jurisdictions permit.
and all carry specific diagnostic features of their
The transition from hunting/gathering to agriculture instance, contemporary maize has a central
represents a major change in human activity, diet, stalk, with the ears growing at the tips of short
and social and economic organization. Domestica- branches. The ancestral species, teosinte, was
tions changed the biology of plants and animals, and highly branched (see Figure 8.3). This allows
even of humans – for instance, the ability to digest more plants per unit area tilled; it is analogous
lactose past infancy is associated with domestication to building skyscrapers in cities.
of cattle. – seeds do not fall off the plant (called shattering).
Many different plants were domesticated, in differ- However, to facilitate harvesting the link of the
ent regions around the world (Figure 8.2). Although seed to the plant should be relatively weak. The
these domestications were independent events, there loss of seed dispersal can render the plant no
are many common features: longer viable in the wild.
• Characteristics that improve the product: • Tillering – shoots fill empty spaces between plants.
This makes it unnecessary to plant seeds at specific
– enlargement of fruit and/or seed.
intervals.
– improved flavour and/or nutrition.
• Increased self-pollination.
• Characteristics that facilitate harvesting:
These favourable properties are the result of genetic
– synchronization of ripening time. changes during domestication. Documented types of
– larger central stalks relative to side shoots – changes include: amino acid substitutions, deletion/
technically, increased apical dominance. For truncations altering the functions of individual proteins,
242 8 Genomics and Human Biology
Pepo squash 10 000 B.P. African rice 2000 B.P. Rice 8000 B.P.
Maize 9000–7000 B.P. Pearl millet 3000 B.P. Foxnut 8000 B.P.
Common bean 4000 B.P. Sorghum 4000 B.P. Emmer wheat 10 000 B.P.
Einkorn wheat 10 000 B.P.
Barley 10 000 B.P.
Arrowroot 8000 B.P.
Yam (D. trifida) 6000 B.P.
Cotton 5000 B.P.
Sweet potato 4500 B.P. Peanut?
Manioc 8000 B.P. Yam (D. alata) 7000 B.P.?
Chile peppers 6000 B.P. Banana 7000 B.P.
Taro 7000 B.P.?
Potato 7000 B.P.?
Quinoa 5000 B.P.
Figure 8.2 Populations in many regions of the world have domesticated plants. Dates shown are based on archaeological evidence,
not DNA sequence analysis. (B.P. = before present.)
From: Doebley, J.F., Gaut, B.S., & Smith, B.D. (2006). Cell 127, 1309–1321.
Figure 8.3 Comparison of modern maize with its teosinte progenitor (Z. mays subsp. parviglumis). (a) Teosinte grows many long,
tasselled branches. (b) In modern maize, many short branches bear the ears at their tips. (c) The kernels of teosinte are encapsulated
in a hard compartment. This picture shows both mature (left, dark) and immature (right) kernels. (d) Comparison of kernels of teosinte
and modern maize.
Sources: (a) From US Department of Agriculture, Natural Resources Conservation Service, Plant Materials Program, Plant Release Photo Gallery.
(b) Photo by David T. Webb, distributed by the Botanical Society of America. (c, d) Photographs by Hugh Iltis.
transposon insertion, regulatory changes, splice-site bottlenecks and/or (b) selective sweeps. Comparisons
mutation, and gene duplications. with genes that are selectively neutral, with respect to
In general, domesticated plants show lower genetic domestication phenotypes, reveal the relative import-
diversity than their wild, progenitor species. Causes ance of these two effects. Selectively neutral genes
of loss of genetic diversity include (a) population lose genetic diversity only through bottlenecks.
The domestication of crops 243
A very interesting question about the genetics of teosinte was domesticated in southern Mexico
domestication is: to what extent were the genes between 6000 and 9000 years ago. The earliest arte-
selected for in domestication present in progenitor facts showing domesticated maize are cobs found in
populations, and to what extent are they de novo a cave in Oaxaca, Mexico, dated 6250 years ago.
mutations? Examples of both types are known. The Rafael Guzmán, an undergraduate student at the
large genetic variability in the progenitor population Universidad de Guadalajara, discovered the teosinte
makes it harder to find a rare allele even if it were progenitor strain of maize during field work in
present in the original population. southwestern Mexico in late 1978. Guzmán was
stimulated by a challenge contained in a New Year’s
card from botanist Hugh Iltis. The importance of his
• Studies of the genomes of crop plants have applica-
discovery for maize science cannot be overestimated.
tions to agriculture, including the search for varieties
Access to the progenitor strain makes possible detailed
that produce yields improved in quality and quantity,
comparisons of sequences. Maize and teosinte are
require less fertilizer and pesticides, and are resistant
to disease. If genomes of wild progenitor species are
still interfertile, permitting the reintroduction of spe-
also available, it is possible to study the genomics of cific alleles. The teosinte that Guzmán discovered
domestication. contains unique virus-resistance genes that have been
bred into the maize used in agriculture.
Notable differences between teosinte and modern
maize include:
Maize (Zea mays) • Teosinte has many long branches tipped by tassels.
The cultivation of maize (Zea mays) supported the The tassels correspond to male flowers. Modern
pre-Columbian civilizations of Central and South maize has a single main stalk, with the tassel at
America. Maize was brought to Europe in the 15th the top. The many short lateral branches bear the
and 16th centuries and quickly spread around the ears at their tips. The ears, containing the seeds,
world. Maize is now the world’s third-largest crop of course develop from female flowers. (Separate
plant, after rice and wheat. The annual harvest male and female flowers are a feature of varieties
amounts to 7 × 1011 kg, raised on 3.3 × 1011 km2 of of teosintes and maize, not shared by other grasses.)
land. Maize is grown primarily as food, for humans • The teosinte ear contains 5–12 kernels. Their struc-
and animals, but some is converted into ethanol for ture adapts them for dispersal by passage through
fuel. the digestive tracts of birds and mammals. Individual
In the late 20th century, the intensive development seeds grow inside an encasing glume hardened by
of high-yielding crop varieties began. This ‘green silica and lignin (see Figure 8.3c). For human con-
revolution’, in addition to improving yields, bred sumption, the kernels would have to be ground, or
maize for a higher content of lysine. Corn was for- ‘popped’ by heating. At maturity, teosinte seeds
merly lysine-poor. As a result, people with diets based separate spontaneously, another aid to dispersal.
primarily on maize were traditionally lysine-deficient. In contrast, an ear of modern maize has several
hundred kernels, with no hard casing, and the ker-
nels remain fixed to the ear (see Box 8.3).
• In the USA and Canada, maize is called corn.
These very large phenotypic differences between
teosinte and maize appear in plants that have very
Maize is a domesticated form of a grass called similar genomes. This presents a paradox. Its resolu-
teosinte. Several varieties of wild teosinte survive in tion must emerge from study of the genomes, the
Mexico. Analyses of isozyme and microsatellite proteins, and the expression patterns. How did the
diversity – ‘paternity tests’ for species – have identi- genome change upon domestication? How many
fied the progenitor of maize as a teosinte subspecies, genes were involved? Which genes? How have the
Z. mays subsp. parviglumis. A combination of changes in the genome produced the phenotypic
archaeological and molecular evidence suggests that effects?
244 8 Genomics and Human Biology
expressed. For tb1, the alleles selected for domes- Aromatic: Iran, Pakistan, India, Nepal; including
tication are present in wild teosinte. Basmati.
• Another gene that differs between teosinte and maize What was the course of rice domestication? Was
is tga1 (tga = teosinte glume architecture). This gene there a sequential process:
affects the structure of the glume, the surroundings
of the seed. The teosinte allele produces the hard O. rufiponga → O. sativa japonica →
seed case. The maize allele reduces the glume to a O. sativa indica
soft membrane underneath the kernels. For tga1, (or rufiponga → indica → japonica)?
the expression level is similar in teosinte and maize. These simple models do not explain the observations
However, there is a specific single nucleotide that japonica and india share some domestication
change in one exon, substituting a lysine in a teo- alleles, but not others. The currently accepted model
sinte protein with an asparagine in maize. The involves originally separate O. rufiponga popula-
maize allele has not been found in wild teosintes. tions, separately domesticated to O. sativa japonica
Can we know whether these and other genetic and O. sativa indica, followed by genetic exchange
differences appeared at the time of domestication? and subsequent specialization.
It has been possible to sequence DNA from cobs
found in sites of a variety of ages, the oldest dated at
• Do not confuse the wild progenitor Oryza rufipogon of
4400 years ago. The 4400-year-old cob shows modern- domesticated rice O. sativa with the vegetable known
maize alleles for three genes tested, including tb1, as ‘wild rice’, now grown primarily in the northern
showing that all three were present in the maize United States, Zizania palustris and certain other
population at least that long ago. However, teosinte- Zizania species. These are not close relatives of Oryza.
specific alleles of some genes appear in 2000-year-old
cobs from New Mexico, in the southwest USA. Selec-
tion was incomplete then, at least in regions distant
Golden rice
from the site of original cultivation.
(a) (b)
Figure 8.4 (a) Theobroma cacao tree showing fruit. (b) A ripe pod split open,
showing the beans. To create chocolate from beans, the seeds are fermented
and dried. Processing extracts cocoa butter (triacylglycerols) and cocoa powder
(proteins and polysaccharides, plus small molecules such as flavonoids,
terpenes, and theobromine).
Photo (a) by Paul Bolstad, University of Minnesota Bugwood.org, (b) by Keith Weller,
USDA Agricultural Research Service, Bugwood.org.
finicky plant, unable to tolerate low temperatures or after the Spanish conquests. Today, beyond its native
aridity. This restricts it, primarily, to within 10° latitude habitat in South and Central America, T. cacao is
of the equator. Chocolate was an important drink in important in the agriculture of Brazil and tropical
the major central American Olmec, Maya, and Aztec Africa. West Africa currently produces 70% of the
cultures (see Box 8.4). It became popular in Europe world’s chocolate, led by the Côte d’Ivoire and Ghana.
In the Maya and Aztec civilizations in Central America chocolate from a height into a receptacle, to produce
chocolate was a medicinal and ceremonial drink, and an froth. Even more analogous to wine, in the Maya culture
expensive one.1 Fermentation could produce an alcoholic it served sacramental, nutritious, and medicinal purposes.
beverage. The Aztecs prepared chocolate, not as a sweet The beans also served as currency: a turkey cost 100 cacao
as we know it, but mixed with chilli pepper to form a beans.
spicy, bitter drink. A Maya vase in the collection of the European contact with cacao beans began in 1502 when
Princeton University art museum shows a woman pouring Columbus captured the cargo of two Mayan trading
canoes on his fourth voyage. The beans made little impact.
Wider appreciation of chocolate awaited Cortez’s return
1
‘. . . chocolate drinks occupied the same niche as expensive from his conquest of Mexico. In 1528 Cortez presented to
French champagne does in our own [American] culture.’ S.D. Coe
and M.D. Coe, The History of Chocolate, 2nd. ed., Thames and Charles V a sample of cacao beans, and the tools and recipe
Hudson, London, 2007, p. 61. for preparing them. In 1544, a group of Maya nobles
The domestication of crops 247
accompanied Franciscan friars to the court of Philip II and with Mr. Creed to drink our morning draft, which he did give me in
demonstrated its preparation. chocolate to settle my stomach. (24 April 1661)
In contrast to the spicy, bitter Mayan and Aztec bever-
The odour and taste of natural chocolate derives from
ages, the Spanish added sugar. This initiated the transition
unique flavonoids. Pharmacologically active compounds in
of chocolate from a medicinal to a recreational substance.
chocolate include:
Chocolate beverages containing alcohol are now prepared
by adding spirits to chocolate. • theobromine (10% by weight of dark chocolate): heart
Chocolate was introduced to Europe before coffee or tea stimulant, cough suppressor, vasodilator, lowering blood
were. It was the first exposure of Europeans to stimulant pressure leading to feelings of relaxation, diuretic. Pet
alkaloids such as caffeine. The taste for chocolate grew and lovers be warned: theobromine is toxic to dogs and cats.
spread, first as a drink and later in the solid, familiar ‘bar’
• phenethylamine: a stimulant, similar in effect to
form. It is widely believed that the marriage of Philip II’s
amphetamines.
daughter Anne of Austria to Louis XIII of France in 1615
brought chocolate across the Pyrenees. The first chocolate The 18th-century physician and architect Sir Hans Sloane
house in London opened in 1657. The popularity of these was the inventor of milk chocolate. Sloane is better known
beverages derived originally from a belief in their medicinal as the founding donor of the original collection of the
value, as well as their flavour. Samuel Pepys’s diary men- British Museum. London’s chic Sloane Square was named
tions drinking chocolate on several occasions, alluding to after him because his heirs owned the land developed. Less
its curative properties: well known is that the classic red British phone box, now an
Waked in the morning with my head in a sad taking through the endangered species, was based on his tomb, which Sloane
last night’s drink, which I am very sorry for; so rose and went out himself designed.
Arabidopsis
thaliana
1,047
Theobroma (3,770)
cocoa
Populus
43 trichocarpa
129
(101)
(340) 44
(170) 118
682 65 (323)
(2,053) 208 (236)
(882) 278
172 (1,523) 281
(557) (1,136)
407
(2,115) 572 960
2,254 (1,825) (2,737)
89 6,362
(390) (13,874) 483
3
123 (52,176)
((2,388)
(355)
621
228
(4,839) 368 265
(
(1,279) 304
(2,027) (1,193) (901)
185 96
(475) (464) 66
168
(263)
247 (483)
141
(744)
(358)
Vitis
vinifera 1,148
(3,603)
Glycine
max
Figure 8.5 Numbers of shared and unique gene families and, in parentheses, numbers of genes from chocolate (Theobroma cacao),
thale cress (Arabidopsis thaliana), black cottonwood (Populus trichocarpa), soybean (Glycine max), and wine grape (Vitis vinifera).
From: Argout, X., et al. (2011). The genome of Theobroma cacao. Nat. Genetics 43, 101–108.
• the determinants of resistance to witches’ broom compounds. A. thaliana has 36 genes for flavonoid
and other diseases biosynthesis pathway enzymes; T. cacao has 96.
• the sources of the flavours in the chocolate produced T. cacao criollo also has expanded families of gene
encoding enzymes involved in terpenoid synthesis.
• the place of T. cacao in the general phylogeny of
It would be tempting to hypothesize that the pau-
angiosperm plants.
city of toll interleukin receptor genes might account
Known disease-resistance genes in plants encode for the lower disease resistance of criollo relative to
nucleotide-binding site/leucine-rich repeat (NBS-LRR) forestero, and the enhancement in numbers of genes
proteins and receptor protein kinase (RPK) proteins. for flavonoid and terpenoid biosynthesis enzymes
The T. cacao criollo genome is relatively poor in one might account for the enhanced flavour of criollo
class of NBS-LRR genes, including those encoding relative to forestero. Comparison of the genomes
the toll interleukin receptor (TIR) motif. In T. cacao of the two subspecies, which is so far only in the
only 4% of genes orthologous to NBS-LRR contain preliminary stages, suggests that this may be too
the TIR motif. The corresponding number for A. simplistic an analysis.
thaliana is 65%! In contrast, the T. cacao criollo Comparing the genomes of several flowering
genome contains orthologues of all for NPR1 sub- plants allows inference of the primordial angiosperm
families known in A. thaliana. genome. The evolution of eudicot genomes is charac-
The molecules that give cocoa its distinctive flav- terized by polyploidization events. Therefore synteny
our include flavonoids, alkaloids, and terpenoids. is an important clue to the phylogeny, in addition
T. cacao criollo is rich in enzymes producing these to the divergence of individual gene sequences.
The domestication of crops 249
c9
c2
c8
Cacao 1 10
c7 c3
c6
c4
1 19 1 19 c5
Grape Poplar
(c)
Eudicots ancestor
n=7
n = 21
Paleohexaploid
123 MYA
Eurosid II 90 MYA Eurosid I
77 MYA
59 MYA
0R 2R 0R 1R 2R 0R
56F 30F 11F 71F 82F 4F
1 9 1 5 1 10 1 19 1 20 1 19
Figure 8.6 Syntenic relationships of eudicot genomes, comparing T. cacao with thale cress (Arabidopsis thaliana), soybean (Glycine
max), black cottonwood (Populus trichocarpa), papaya (Carica papaya), and wine grape (Vitis vinifera), and inferring the chromosome
structure of the common ancestor. (a) Mapping of orthologues from T. cacao to other five genomes. For each species chromosomes
are numbered consecutively. Different colours represent seven ancestral eudicot linkage groups. (b) Syntenic relationships within the
T. cacao genome. There is evidence for ancestral triplication of the genome. For example, light green links regions of chromosomes 1,
2, and 8. (c) A model for the evolutionary history of the karyotype. The six species studies derived from an original eudicot ancestor
with 7 chromosomes. To produce the current karyotypes there were whole-genome duplications (R) and chromosome fusions (F).
From comparisons of the chromosomes of the six species, it is possible to trace the path from the current chromosome structure of
the six species (shown at the bottom), back to the common ancestor. The gene distribution among T. cacao chromosomes is the
closest, of these six species, to the ancestral form.
From: Argout, X., et al. (2011). The genome of Theobroma cacao. Nat. Genetics 43, 101–108.
250 8 Genomics and Human Biology
Figure 8.6 shows the syntenic relationships of five of these five species, T. cacao appears to be closest to
eudicot species. The results suggest that the ancestral the ancestral form.
eudicot contained seven chromosomes. Interestingly,
Genomics in anthropology
Our genomes contain the history of the origins and DNA extracted from bones from three individuals
development of our species. We have already seen from a site in Croatia produced a total of 4 Gb of
how genomics can elucidate phylogenetic relation- Neanderthal sequence.
ships, and how palaeosequencing of extinct species A three-way comparison among Neanderthal,
can widen the coverage of these studies. The palaeo- human, and chimpanzee sequences allows analysis of
sequence that has most captured popular interest human–Neanderthal divergence. There are 78 sites in
has been that of one of our closest relatives: the protein coding genes containing amino acids specific
Neanderthal genome. to humans, that is, the amino acid at that position is
the same in chimpanzee and Neanderthal, and differ-
ent in human. In contrast, one result of interest is
The Neanderthal genome
that Neanderthals have the human version of the
In 1856, about a dozen bones were found in a lime- FOXP2 gene, involved in language skills in humans.
stone quarry in the Neander Valley near Düsseldorf The data suggest that the populations ancestral
(Valley = Thal in German). Although, in retrospect, to Neanderthals and humans split about 270 000–
Neanderthal bones had appeared earlier, the 1856 400 000 years ago. Admittedly, this is a very rough
discovery was the first to be recognized as remains estimate. But it antedates the migration of humans
from a new species. from Africa. It suggests the following scenario:
Neanderthals were hominids closely related to the populations split in Africa 370 000 years ago.
modern humans. The earliest fossils in which they One population migrated to Europe to give rise
appear are about 130 000 years old. They were to the Neanderthals. The other remained in Africa,
approximately the same size as humans, but sub- and gave rise to humans. Then humans migrated to
stantially more robust. Their cranial capacity was Europe about 40 000 years ago, where they coexisted
as large as modern humans or perhaps slightly larger, with Neanderthals for about 6000 years.
although this in itself does not guarantee comparable Did humans interbreed with Neanderthals? The
cognitive abilities. Neanderthals did fashion and use question is quite controversial. It is argued that
tools, and produce at least decorative arts. It has by looking at human variation patterns from the
been suggested that they wore makeup. They engaged HapMap data, Neanderthals are more closely related
in complex funerary practices. to humans from regions other than Africa, than to
Both modern humans and Neanderthals inhabited African populations. If there were no genetic admix-
Europe and Central Asia, until the Neanderthals ture, one would expect no difference: Neanderthals
vanished, about 30 000 years ago. Articles in scien- should be equally closely related to all human popula-
tific journals inquire about their similarities to and tions. The scenario implied is that on their way out
differences from modern humans, and why they of Africa, humans interbred with Neanderthals and
became extinct. The popular press focuses on the carried the genes thereby picked up around the rest
possibility of sexual congress between Neanderthals of the world. The estimate is that 1–4% of the human
and humans. genome is Neanderthal in origin. One might reason-
The sequencing of Neanderthal mitochondrial DNA ably suppose that Neanderthals should be most
was completed in 2008. For the nuclear genome, closely related to European human populations – for
Genomics in anthropology 251
that is where they coexisted for longest – but the data DNA sequences, which reveal lines of maternal
do not support this. inheritance (see Box 8.5). Y chromosomes provide
complementary paternal information. Although
studies of Y sequences are more sparse, in many cases
• It has been possible to sequence DNA from bones
they corroborate the implications of the mitochon-
of Neanderthals, a species of hominid that has
been extinct for about 30 000 years. Humans and
drial data.
Neanderthals both inhabited Europe for about 6000 There is now a consensus that our species, Homo
years. One question addressed using the Neanderthal sapiens, arose in Africa approximately 100 000–
sequence is whether there was human–Neanderthal 150 000 years ago. Migrations beginning approxi-
interbreeding. It has been estimated that 1–4% of the mately 60 000 years ago took our ancestors around
human genome is Neanderthal derived. However, this the world, and continue to do so. Unlike modern
conclusion is controversial. population flows, documented in historical records,
we depend on archaeological relics, modern genom-
ics, and linguistics to infer the timing, the routes, the
numbers of individuals, and even perhaps the moti-
vation of ancient migrations.
Ancient populations and migrations
Other crucial transitions in human social organiza-
When people move around, they take their DNA tion, such as turning from hunting to agriculture, are
with them. This makes is possible to trace patterns reflected to some extent in domestications of other
of migration. Many studies focus on mitochondrial species such as maize and dog.
Human mitochondrial DNA is a double-stranded, closed, 1 kb long. It shows a higher rate of substitution than the
circular molecule 16 569 bp long. It is inherited almost rest of the mitochondrial genome, by a factor of about
exclusively through maternal lines. A fertilized egg contains four.
the mother’s mitochondria. Although sperm contain mito- Different mitochondrial DNA sequences are associated
chondria – essential to provide energy for their motility – with different populations. Mutations are referred to the
the few paternal mitochondria that enter the egg are first human mitochondrial DNA sequence determined,
selectively eliminated. As a haploid entity, mitochondrial called the Cambridge Reference Sequence. Groups of
DNA is, therefore, not subject to recombination, and related sequences are called haplogroups. (The distribution
changes only by mutation. of the number of sequence differences between different
Mitochondrial DNA is estimated to adopt one mutation individuals has a peak at ∼70 for Africans and ∼30 for non-
every 25 000 years. This gives a reasonable rate of diver- Africans.) The original classification of sequence variants
gence to trace human migration patterns. (Nuclear DNA depended on changes in restriction sites (see Figure 8.7).
mutates approximately ten times more slowly than mito- This was followed by explicit sequencing of the control
chondrial DNA because (1) histones protect it; (2) active region, focusing on its two highly polymorphic segments.
repair mechanisms edit out some mutations; and (3) the For finest resolution, contemporary studies are now more
activity of mitochondria in oxidative phosphorylation frequently determining full mitochondrial DNA sequences,
exposes the DNA to mutagenic oxygen radicals.) except in cases of ancient DNA where the best recoverable
Human mitochondrial DNA contains genes for 22 tRNAs, material may be fragmentary.
two ribosomal RNAs, and 13 proteins. The major non- Several databases focus on human mitochondrial
coding region is the control region, or D-loop, involved in genomes, including MITOMAP (http://www.mitomap.
regulation and initiation of replication. This region is about org) and mtDB (http://www.genpat.uu.se/mtDB).
➔
252 8 Genomics and Human Biology
Prevalent in:
L1 Africa
L2 Africa
‘Eve’ L3 L3+ Africa
CZ Z Siberia
C NE Asia, Amerinds
D NE Asia, Amerinds
M M* India, Asia
E
G East Asia
Q Oceania
A Asia, Americas
Z Russia, Saami
W W. Urals, E. Baltic
N X Near East, Caucausus
Y Siberia
N* Asia, Oceania
B Amerinds, SE Asia
F Asia
H1 Europe
HV
H3 Europe
R HV0 V Oceania
R* East Asia
P Oceania
T Eastern Baltic Sea, Urals
J Europe
U Europe
K Europe
Figure 8.7 Phylogenetic tree of major mitochondrial haplogroups. The nomenclature began with a study of Native Americans, or
Amerinds, and the letters A, B, C, and D were assigned to them. Other letters were introduced and were subdivided as needed as
more detailed sequencing data appeared. HV0 was formerly called pre-V.
P. Forster has created a ‘movie’ showing successive haplogroup, which arose ∼84 000 years ago. Muta-
stages of human dispersal (see Figure 8.8). tions in L3 gave rise to haplogroups M and N,
The evidence for human origins in Africa is that ∼60 000 years ago (see Figure 8.8c). Haplogroup M
contemporary genetic diversity is highest there. The mitochondrial DNA appears in ancient populations
mitochondrial DNA haplogroup L1, believed to be the from the Andaman Islands, southern continental
oldest haplotype that survives, is found in the KhoiSan India, and the Malaysian Peninsula. This suggests a
of the Kalahari Desert in southern Africa and in the dispersal via what is now southern Iran along the
Biaka pygmies of the central African rainforest (see coast to India, the Malaysian Peninsula, and Austra-
Figure 8.8a). An expansion of L2 and L3 haplogroups lia. Archaeological evidence shows that humans
took place within Africa about 80 000–60 000 years reached Australia by 46 000 years ago. Dating of the
ago (see Figure 8.8b). Over two-thirds of contempor- earliest human remains in Australia is consistent with
ary Africans belong to these groups. the extinction of many large mammals and birds
The first emigration from Africa occurred about shortly thereafter (see Figure 8.8d).
85 000–55 000 years ago. The participants in this From 60 000 to 30 000 years ago, human popula-
first dispersal carried the L3 mitochondrial DNA tions expanded in southern Europe and Asia, accu-
Genomics in anthropology 253
mulating mutations in mitochondrial DNA to form tered and replaced the Neanderthal population (see
new haplogroups (see Figure 8.8e). Figure 8.8f). The same climate conditions permitted
Expansion into Europe was delayed and inter- a northwards expansion in Asia.
rupted. Between 20 000 and 30 000 years ago – an The closing of the Bering Strait allowed humans
interglacial era – the first human Europeans encoun- from northwest Asia to move across to North
erthals
Neand ctu
s
o ere
m
Skhul/Qafzeh Ho
lsrael
L1 L1
Omo valley,
L0 Ethiopia
L1
(a)
erthals s?
Neand ctu
o ere
m
Ho
L2
L3
L3
L1 L2
L3
L3
L1
L2 L3
L3
L1
(b)
Figure 8.8 ‘Movie’ of human migration patterns, based on mitochondrial DNA sequences. Letters indicate haplotypes.
(a) The beginnings, ∼150 000 years ago. (b) Expansion and divergence within Africa, 80 000–60 000 years ago. (c) Out of
Africa, 60 000–50 000 years ago. (d) Spread through the south coast of the Indian Ocean, reaching Australia, 50 000–30 000 years
ago. (e) Expansion in the eastern Mediterranean, India, southeast Asia and Australia, and central Asia, 30 000 years ago. (f) After
30 000 years ago, milder climate conditions permitted expansion northwards in Europe and Asia. (g) Crossing the Bering Strait,
first human inhabitants of American continents, 20 000–15 000 years ago. (h) Ice Age, 20 000 years ago, with humans forced
to retreat south. (i) Spread to South America, 18 000 years ago. (j) Warmer climate, with resettlement of northern latitudes,
15 000–13 000 years ago. (k) Sudden warming and subsequent stable climate, 11 400 years ago, allowing the spread of agriculture.
(l) Expansion to islands, 2000 years ago. (m) The current picture.
From: www.rootsforreal.com. See also: Forster, P. (2004). Ice ages and the mitochondrial DNA chronology of human dispersals: a review. Phil. Trans. R.
Soc. B: Biol. Sci. 359, 255–264 (Figure 5).
254 8 Genomics and Human Biology
erthals
Neand
N
L2 M
L3
L1 L2
L3
L1
L2 L3
L1
(c)
erthals
Neand N
M
N
N
N
L2 M
L3 M
L1 L2 Niah Caye,
L3
Bomeo
L1 M
N
L2 L3 Lake Mungo,
L1 Australia
(d)
Yana river,
Siberia
B F
A D
ls
rtha My
nde H
Nea U JT N
M
I N
R U
L2 M
L3 L2
L1
L3 Mx Mx Q
Mx
L1 M
N P
L2 L3
Mz Nz
L1
(e)
Masterov Kliuch,
Gravettian, B F Siberia
Europe A D
My
H JT N
U M
I N Zhoukoudian,
R China
U
L2 M
L3
L1 L2
L3 Mx Mx Q
Mx
L1 M
N P
L2
L3 Mz Nz
L1
(f)
X X
B B C
C A
A D D
F My
H JT
U Ny
l N R
R U
L2 M
L3
L1 L2
L3 Mx Mx Q
Mx
L1
P
L2 L3
Mz Nz
L1
(g)
D A
X C
H JT A D
B C
A H V U F
D R B
l N
R U Ny
L2 My
L3 My
L1 L2
L3
L1 Q
P
L2 L3 Mz
Nz
L1
(h)
D A
X C
H JT A D
B C
A Meadowcroft, H V U F
D R B
Pennsylvania l N
R U Ny My
L2
L3 Mx
L1 L2
L3
B L1 Q
C P
A D L2 L3 Mz
Nz
L1
(i)
Mesa,
Alaska A Studence,
C Y Siberia
H V U Z
D A
A I
X C
Magdalenian, H JT A D
B C Europe Ny
A H V U My
D R
I N F
Clovis, B
R U
New Mexico L2
L3 Mx
L1 L2
L3
B L1 Q
C Monte Allegre,
P
A D Brazil L2 L3 Mz
Monte Verde, Nz
Chile L1
(j)
A
C Y
H V U Z
D A
A I
J C
X T
H J A D
B C T Ny
A H V U My
D R Mx
I N F
R R U B
L2 M1 R M1 ?
L3
L1 L2
L3 B
B
B L1 ? Q
C P
A D L2 Mz
L3
Nz
L1
(k)
A
A A Y A
C
H V U
D A Z
A l
J
X T C
H J A D
B C T Ny My
A H V U
D R
l N F
R R U B
L2
L3
M1 R M1
B L1 L2 Mx B
L3 F
B
B L1 Q
C P
B A L2 L3
D B Mz B
Nz
L1
B B
(l)
A
A A A
C Y
H V U Z
D A
A l
J T C
X A D
H J
B C T Ny
A H V U My
D R
l N F
R R U B
L2
L3
M1 R M1
B L1 L2 Mx B
L3 F B
B L1 Q
C P
B A L2 L3
D B Mz B
Nz
L1
B B
(m)
America and to expand southwards (see Figure 8.8g). reduced. The ending of the last ice age, quite abruptly
Human remains found in Alaska have been dated 11 400 years ago, permitted the expansion of agricul-
to 9800–9200 years ago. Current evidence sug- ture (see Figure 8.8k, and Box 8.6).
gests that humans first arrived in America 20 000–
15 000 years ago.
• Agriculture reached Britain about 5000 years ago.
When glaciers covered northern Europe and arctic
America again, humans retreated southwards (see
Figure 8.8h and i). Only isolated pockets of the Late migrations, during the last 2000 years, popul-
original settlers remained in Europe. One such pocket ated islands such as Greenland by Inuit from Alaska,
was ancestral to the Basques, with their now-unique Madagascar by people from southeast Asia, and
H and V mitochondrial haplogroups. A subsequent Pacific islands by Austronesians (see Figure 8.8l).
warm period, starting 15 000 years ago, saw resettle- The contemporary distribution of mitochondrial
ment of northern Europe (see Figure 8.8j). The haplogroups contains the records of this history (see
genetic diversity in northern Europe is accordingly Figure 8.8m).
258 8 Genomics and Human Biology
We have discussed crop domestication from the plants’ Palaeolithic European inhabitants – and Middle Easterners
point of view. Let us now examine the human conse- – representing the originators of agriculture.* The study
quences. There is a consensus that agriculture began in the included markers on autosomes and the Y chromosome.
Middle East about 12 000 years ago. After the glaciers The mean admixture suggested an approximately equal con-
receded, farming spread northwards through Europe, sup- tribution from the two populations. There is a geographic
porting a population increasing in size. gradient. The Middle-Eastern contribution decreases as the
The nature of the process has been the subject of debate distance from the Middle East increases.
– a debate that has not become noticeably less heated as The study by Dupanloup and co-workers used con-
more data have been measured. Opinions span a spectrum temporary populations as representatives of ancient
between the extremes of a movement of people – farmers ones. Another study examined ancient DNA directly.†
and their descendants who moved north and east – and Mitochondrial DNA was sequenced from remains at
dissemination of culture – adoption of farming by rem- 7500-year-old Neolithic sites linked by cultural artefacts to
nants of Palaeolithic Europeans and their descendants. the initial spread of farming. These mitochondrial DNA
The spread of Indo-European languages along the same results suggest that these early farmers did not contribute
northeast vector, correlated with the movement of agricul- a large fraction to the European mitochondrial gene pool.
ture, is ambiguous – like agriculture, languages can travel Sequences from the Y chromosome confirm this.
either by migration of people or by cultural transmission. The conclusion is that early farmers did migrate to
However, surveys of genomes of contemporary and Europe from the Near East.
ancient Europeans should be able to detect the relative
contributions of Neolithic Middle Eastern farmers and * Dupanloup, I., Bertorelle, G., Chikhi, L., & Barbujani, G. (2004).
Palaeolithic Europeans. Estimating the impact of prehistoric admixture on the genome of
The contributions to contemporary European DNA gene Europeans. Mol. Biol. Evol. 21, 1361–1372.
†
Haak, W., Balanovsky, O., Sanchez, J.J., et al. (2010). Ancient
sequences have been apportioned to several popul- DNA from early European neolithic farmers reveals their near east-
ations including the Basques – representing the original ern affinities. PLoS Biol. 8, e1000536.
Language, as a biological phenomenon, links gen- ken, a few major ones such as Chinese, English,
omics with several other disciplines, including neuro- Hindi, and Spanish can each claim over 400 million
biology, development, medicine, and anthropology. speakers (see Table 8.1). Others have much smaller
Language is also a pre-eminent social phenomenon. communities: Europe retains niche languages such as
It underlies interpersonal communication (both face- Basque and Breton. There were estimated to be about
to-face and worldwide), commerce, and literature. 250 languages of indigenous Australians at the time
Many important social decisions (e.g. the design of of settlement by Europeans, of which about two-
educational systems) depend on appreciating the thirds survive. There are over 800 distinct languages
biological substrates of language. These biological spoken on the island of New Guinea.
substrates are subtle and our knowledge of them is Human spoken languages have many features in
far from complete. common with biological species. Both languages and
Language is a feature of all human populations. species exhibit varying degrees of similarity and
Of the approximately 6500 languages currently spo- diversity. Languages and species can be classified into
Genomics and language 259
Table 8.1 Major languages of the world much more rapid than structural change – molecular
biologists might find a useful analogy to the rapidity
Native language Number of speakers
of sequence change relative to structural change in
Mandarin Chinese >1 000 000 000 proteins!
English 500 000 000 There are obvious correlations between people’s
Hindi 495 000 000 native language, and physical features that are genet-
Spanish 425 000 000 ically determined; for instance, blue or brown eyes
correlate well with native speakers of Swedish or
Mandarin, respectively. However, language is clearly
taxonomies that, at the highest levels at least, are transmitted culturally rather than genetically: any
hierarchical. In both languages and species, ancestor– child can become a fluent speaker of any language to
descendant relationships can be observed. Languages which he or she is exposed in the cradle. (There is,
diverge, as species do. The Romance languages, however, a finite window: it is much more difficult
divergent descendants of Latin, are the Galapagos after puberty to become fluent in a new spoken
finches of linguistics. Latin itself sits within the language; some aspect of neural plasticity is shut
larger class of Indo-European languages, of which down. As a result, language has and continues to
Sanskrit, the classical language of India, is the extant serve as a litmus test for immigrants. This is, unfor-
language putatively closest to their common ances- tunately, frequently used as a criterion for social
tor. Dialects of languages are analogous to genetic discrimination, by persons who may even believe sin-
haplotypes. cerely that there is a correlation between imperfect
Just as species have become extinct, so have many speech in a non-cradle language and inherent or
languages. Many languages are currently ‘endan- genetic inferiority.)
gered’, with only a few elderly speakers remaining. Although there is no direct genetic link with native
Fortunately, like species, languages can be rescued language, L.L. Cavalli-Sforza and co-workers found
from extinction. Navajo, in decline until the 1940s, substantial similarities between clustering of genetic
is now thriving. Hebrew, which had survived as a variation among human populations and language
written language, re-emerged 50 years ago as the groups. For instance, just as the Basque language
spoken language of the new country of Israel. In some is an isolate, unrelated to any other language spoken
cases, it is possible to reconstruct an extinct common in Europe, the Basque people are also genetically
ancestor of surviving descendant languages. This can distinct. This link between languages and genetics is
have interesting implications about the lifestyles of useful because we can estimate rates of genetic change
its speakers. For instance, the similarities in some and thereby infer when languages arose and diverged.
words in Indo-European languages, and dissimilar- One conclusion is that most of the world’s language
ities in others, suggests that the speakers of the families arose, in a burst, between 6000 and 25 000
ancestral Indo-European language had domesticated years ago. When such inferences from genetics can
dogs and horses but not cats and camels. be compared with datable archaeological evidence or
Linguists have developed quantitative methods to historical records, the results are sometimes, although
measure divergence of languages. Languages develop not always, gratifyingly consistent.
variation in pronunciation (or spelling, if the lan- Comparison of genetic and linguistic data can illu-
guage is written) and in structure. An example of minate what happens when two populations collide.
variation in pronunciation would be the difference Some discrepancies between correlations of genetics
between the English word ship and the German and languages appear when conquerors impose a
cognate Schiff. An example of a structural feature new language on an indigenous population.
of a language is word order: in sentences in English
and many other languages, the typical word order is In Ivanhoe, Sir Walter Scott contrasted the French word
subject–verb–object, as in ‘The mouse ate the cheese’. veau for the meat, with calf for the animal, noting that
In some languages, such as Turkish, typical word ‘. . . he is Saxon when he requires tendance, and takes a
Norman name when he becomes matter of enjoyment’.
order is subject–object–verb. Phonological change is
260 8 Genomics and Human Biology
• This happened in (what is now) Great Britain after • In contrast, the fall of Rome in 476 ad was not
the Anglo-Saxon invasions around the 5th century. followed by a replacement of Latin by a Germanic
The indigenous Celtic languages were pushed to tongue.
far reaches of the British Isles (and Brittany; Breton
Conversely, although native language is not deter-
is a Celtic language). Less effective, linguistically,
mined by genes, language differences can retard
was the Norman Conquest of 1066, after which
genetic mixing among populations by creating cul-
English absorbed many French words but retained
tural barriers to gene transfer.
its own vocabulary alongside them.
Linguistics will continue to be an area in which the
• Hungary was converted to speaking a Uralic lan- very subtle links between genomics and culture can
guage upon invasion by the Magyars in 896 ad. be explored.
● RECOMMENDED READING
• A historical review, and a recent article, on the use of DNA sequences in personal identification:
Jeffreys, A.J. (2003). Genetic fingerprinting. In Changing Science and Society. T. Krude (ed.)
Cambridge University Press, Cambridge, pp. 44–67.
Giardina, E., Spinella, A., & Novelli, G. (2011). Past, present and future of forensic DNA typing.
Nanomedicine 6, 257–270.
• Discussions of domestication of crop plants:
Murphy, D.J. (2007). People, Plants, and Genes. The Story of Crops and Humanity. Oxford
University Press, Oxford.
Doebley, J.F., Gaut, B.S., & Smith, B.D. (2006). The molecular genetics of crop domestication.
Cell 127, 1309–1321.
Tang, H., Sezen, U., & Paterson, A.H. (2010). Domestication and plant genomes. Curr. Opin.
Plant Biol. 13, 160–166.
Zeder, M.A. (2006). Central questions in the domestication of plants and animals. Evol.
Anthropol. 15, 105–117.
Doebley J. (2006). Plant science. Unfallen grains: how ancient farmers turned weeds into crops.
Science 312, 1318–1319.
Staller, J., Tykot, R.H., & Benz, B. (2006). Histories of Maize: Multidisciplinary Approaches to
the Prehistory, Linguistics, Biogeography, Domestication, and Evolution of Maize. Elsevier,
London.
• The contribution of genomics to understanding the history of agriculture:
Armelagos, G.J. & Harper, K.J. (2005). Genomics at the origins of agriculture. Evol. Anthropol.
14, 68–77, 109–121.
• A discussion of the ‘green revolution’ and extrapolating from contemporary agricultural statistics
to the needs of the future and how they might be met:
Davies, W.P. (2003). An historical perspective from the Green Revolution to the gene revolution.
Nutr. Rev. 61, S124–S134.
• The Neanderthal genome and relation to the human:
Noonan, J.P. (2010). Neanderthal genomics and the evolution of modern humans. Genome Res.
20, 547–553.
Exercises, problems, and weblems 261
Exercises
Exercise 8.1 On a photocopy of Figure 8.1, (a) sketch in and indicate by the letter X a band, the
appearance of which would prove that F (F = the person whose DNA produced lane F) was not
the father of child C, but individual M could be the mother. (b) Sketch in and indicate by the letter
Y a band, the appearance of which would prove that M was not the mother of child C, but F
could be the father. (c) Sketch in and indicate by the letter Z a single band, the appearance of
which would prove that M was not the mother of child C and F was not the father.
Exercise 8.2 On a photocopy of Figure 8.1, relabel lane M as C, lane F as M, and lane C as F.
Now M sues F for support of child C, alleging paternity. Can F prove from these data that he is
not the father of C?
Exercise 8.3 Why could DNA analysis not decide whether or not a woman in the UK was named
Sykes?
Exercise 8.4 Chocolate (Theobroma cacao) is not shown on the map in Figure 8.2. Insert T. cacao
on a photocopy of this figure.
Exercise 8.5 One of the papers on the recently sequenced Theobroma cacao genome states
that the Belizean criollo genotype ‘. . . is suitable for a high-quality genome sequence assembly
because it is highly homozygous as a result of the many generations of self-fertilization that
occurred during the domestication process’. Why do these characteristics make this variety suitable
for a high-quality genome sequence assembly?
Exercise 8.6 (a) How many genes in Theobroma cacao do not have homologues appearing in
Arabidopsis thaliana, Populus trichocarpa, Glycine max, and Vitis vinifera? (b) How many
262 8 Genomics and Human Biology
gene families are shared by all four species? (c) How many gene families appear in A. thaliana,
P. trichocarpa, Glycine max, and V. vinifera, but not in T. cacao?
Exercise 8.7 How many syntenic groups in the T. cacao genome show evidence for triplication
of the genome? List the triplets of chromosomes linked by each group.
Problems
Problem 8.1 Jeffreys has recently introduced a newer identification method based on one VNTR
locus, D1S8 (or MS32) (see Figure 8.9). This locus contains hundreds of repeats of a 19 bp
sequence. One base in the repeat is hypervariable, with the result that each repeat may contain,
or lack, a HaeIII restriction enzyme cut site. The feature identifying any individual’s DNA is a binary
code specifying the sequence of presence or absence of HaeIII restriction sites in successive repeats
within the locus. (The sites on homologous chromosomes vary independently, multiplying the
variability.)
The identification is read by designing PCR primers: one to a common sequence outside the
repeat, and two others to each of the alternative repeat sequences. Amplification using the
common primer and one of the others produces a series of fragments: each fragment spans
the sequence complementary to the common primer, outside the repeat, to each repeat
complementary to the other primer. These fragments can be separated by size on a gel and
read off as a bar code. A similar, separate amplification using the other primer can detect
fragments that contain the other possible sequence.
Sketch the appearance of a gel containing two lanes, one from each of the amplifications shown
in Figure 8.9.
Figure 8.9 Variable repeat of regions containing either of two alternative sequences, one of which
contains a cutting site for restriction enzyme HaeIII. The order of the choices between the alternative repeat
sequences provides a ‘binary code’. In this diagram, blue represents one repeat sequence and red the other.
Top: one PCR amplification is based on primers to one of the alternative sequences (blue arrow) and to a
common region flanking the repeat (magenta arrow), which produces fragments starting at each of the
occurrences of one of the alternative repeat sequences. The fragments produced are indicated below the
arrows. Bottom: a second PCR amplification based on a primer to the other repeat sequence (red arrow)
and the common flanking region (magenta arrow) produces fragments starting at each of the occurrences
of the other repeat sequence. The fragments produced are indicated below the arrows. (This is a somewhat
simplified description of the actual technique.)
Problem 8.2 George Beadle crossed a strain of maize with teosinte and examined the second
generation (F2). Approximately 1/500 plants was identical to the maize grandparent and
approximately 1/500 was idential to the teosinte grandparent. Assuming Mendelian inheritance
of n unlinked genes, such that the teosinte grandparent was homozygous for n teosinte alleles and
the maize grandparent was homozygous for n maize alleles, estimate the value of n such that the
fraction of F2 plants genetically identical to the grandparents at all n loci is approximately 1/500.
Exercises, problems, and weblems 263
Problem 8.3 Numerous genera of ants cultivate fungi for food. Suggest some similarities
and some differences between the domestication of fungi by ants and the domestication of
agricultural crops by humans.
Problem 8.4 If humans first entered Alaska from Asia 20 000 years ago and archaeological
remains are found in Monte Verde, Chile, dated 14 000 years ago, what mean annual rate of
southward migration speed is implied? What two villages or cities in the vicinity of where you
live are as far apart as this rate would imply for 10 years of migration?
Problem 8.5 What accounts for the paucity of disease-resistant genes in T. cacao? It has been
suggested that the genes encoding toll interleukin receptor motif proteins are older than the
angiosperm–gymnosperm split, and that they have been lost in lineages including genes in
T. cacao. (a) What alternative hypothesis is suggested by the observation that T. cacao is close
to the ancestral eudicot? (b) How might these hypotheses be tested?
Weblems
Weblem 8.1 What is the difference in structure between theobromine and caffeine?
Weblem 8.2 (a) Are there any other species in the genus theobroma? (b) What other genus of
plant is most closely related to theobroma?
Weblem 8.3 A region of human mitochondrial DNA sequence, positions 16 047–16 385, was
determined to have the following mutations, relative to the Cambridge Reference Sequence (CRS):
Position (nt)
CRS C T T C C C C
Unknown A C C T T T T
Microarrays and
Transcriptomics
LEARNING GOALS
Introduction
Microarrays provide the link between the static We infer protein expression patterns from measure-
genome and the dynamic proteome. We use micro- ments of the relative amounts of the corresponding
arrays: (1) to analyse the mRNAs in a cell, to reveal mRNAs. Hybridization is an accurate and sensitive
the expression patterns of proteins; and (2) to detect way to detect whether any particular nucleic acid
genomic DNA sequences, to reveal absent or mutated sequence is present. Microarrays achieve high-
genes. throughput analysis by running many hybridization
For an integrated characterization of cellular activ- experiments in parallel (see Box 9.1).
ity, we want to determine what proteins are present, Expression patterns can also help to identify genes
where and in what amounts. that underlie diseases. Some diseases, such as cystic
fibrosis, arise from mutations in single genes. For
these, isolating a region by genetic mapping can help
• The transcriptome of a cell is the set of RNA molecules
it contains; the proteome is its proteins.
to pinpoint the lesion. Other diseases, such as asthma,
depend on interactions among many genes, with
Compare the following types of measurement: oligomeric probe in each spot in the array, measurement
of the positions of the hybridized probes identifies their
• ‘One-to-one’. To detect whether one oligonucleotide sequences. This identifies the components present in the
has a particular known sequence, test whether it can sample (see Figure 9.1).
hybridize to the oligonucleotide with the complementary Such a DNA microarray is based on a small wafer of glass
sequence. or nylon, typically 2 cm2. Oligonucleotides are attached to
• ‘Many-to-one’. To detect the presence or absence of the chip in a square array, at densities between 10 000 and
a query oligonucleotide in a mixture, spread the mixture 250 000 positions per cm2. The spot size may be as small as
out and test each component of the mixture for binding ∼150 mm in diameter. The grid is typically a few centimetres
to the oligonucleotide complementary to the query. This across. A yeast chip contains over 6000 oligonucleotides,
is a northern or Southern blot. covering all known genes of Saccharomyces cerevisiae.
• ‘Many-to-many’. To detect the presence or absence of A DNA array, or DNA chip, may contain 400 000 probe
many oligonucleotides in a mixture, synthesize a set of oligomers. Note that this is larger than the total number of
oligonucleotides, one complementary to each sequence genes, even in higher organisms (excluding immunoglobu-
of the query list, and test each component of the mixture lin genes). However, the technique requires duplicates
for binding to each member of the set of complementary and controls, reducing the number of different genes that
oligonucleotides. Microarrays provide an efficient, high- can be studied simultaneously. Nevertheless, it is possible
throughput way of carrying out these tests in parallel. to buy a single chip containing all known human genes
(not all immunoglobulin genes, of course). Also available is
To achieve parallel hybridization analysis a large number a set of ‘tiling’ chips that cover the entire human genome
of DNA oligomers is affixed to known locations on a rigid sequence.
support, in a regular two-dimensional array. The mixture A mixture is analysed by exposing it to the microarray
to be analysed is prepared with fluorescent tags to permit under conditions that promote hybridization, then washing
the detection of the hybrids. The array is exposed to away any unbound oligonucleotides. To compare material
the mixture. Some components of the mixture bind to from different sources, the samples are tagged with differ-
some elements of the array. These elements now show the ently coloured fluorophores. Scanning the array collects
fluorescent tags. Because we know the sequence of the the data in computer-readable form.
Introduction 267
Sample Control
Isolate mRNA
cDNA synthesis
(control and sample
tagged with different
fluorescent dyes)
Hybridization,
washing
Figure 9.1 Schematic diagram of a microarray experiment.
A sample to be tested is compared with a control of known
properties. From each source, mRNA is isolated and converted
to cDNA, using reagents bearing a fluorescent tag, with
different colours for the control and sample. After hybridizing
to the microarrays and washing away unbound material, the
bound target oligonucleotides appear at specific positions. A
red spot indicates binding of oligonucleotides from the sample.
Measure
A green spot indicates binding of oligonucleotides from the
fluorescence
pattern control. A yellow spot indicates binding of both. Each probe,
represented here by a wavy black line affixed to the support,
really contains many copies of a single oligonucleotide. Indeed,
for accurate measurement, the concentration of the target
must greatly exceed the concentration of the probe. If both
red- and green-tagged targets are complementary to the
oligonucleotide probe at one spot, both can bind to different
probe molecules within the same spot.
The probe sequences, fixed on the chip, are large between two samples by a factor ≥1.5–2 is generally
pieces of genomic DNA from known chromosomal considered a significant difference.
locations, typically 500–5000 bp long. The target
mixtures contain genomic DNA from normal or Applications of DNA microarrays
disease states. For instance, some types of cancer
arise from chromosome deletions, which can be • Investigating cellular states and processes. Profiles
identified by microarrays. of gene expression that change with cellular state or
growth conditions can give clues to the mechanism
• Mutation or polymorphism microarray analysis is
of sporulation, or to the change from aerobic to
the search for patterns of single-nucleotide poly-
anaerobic metabolism. In higher organisms, varia-
morphisms (SNPs). The oligonucleotides on the
tions in expression patterns among different tissues,
chip are selected from reference genomic data.
or different physiological or developmental states,
They correspond to many known variants of indi-
illuminate the underlying biological processes.
vidual genes.
• Comparison of related species. The very great simi-
• Protein microarrays are arrays of protein detectors
larity in genome sequence between humans and
– usually antibodies – that detect protein–protein
chimpanzees suggested that the profound pheno-
interactions.
typic differences must arise at the level of regula-
• Tissue microarrays collect and assemble micro- tion and patterns of protein and RNA expression,
scopic samples of tissue. They permit comparative rather than in the few differences between the
analysis of the molecular biology and immuno- amino acid sequences of the proteins themselves.
histochemistry of the samples. Microarrays are the appropriate technique for
following up this idea.
• DNA microarrays analyse the RNAs of a cell, to reveal • Diagnosis of genetic disease. Testing for the
expression patterns of proteins and of non-protein- presence of mutations can confirm the diagnosis of
coding RNAs; or genomic DNAs, to reveal absent or a suspected genetic disease. Detection of carriers
mutant genes. One of the reasons that microarrays are can help in counselling prospective parents.
such a versatile technology is that there are many dif- • Genetic warning signs. Some diseases are not deter-
ferent kinds of chips. Commercially available chips can mined entirely and irrevocably by genotype, but
target all or at least most known genes for individual
the probability of their development is correlated
species, for example, or even tile an entire genome.
with genes or their expression patterns. Microarray
profiling can warn of enhanced risk.
Microarray data are semiquantitative • Precise diagnosis of disease. Different related types
of leukaemia can be distinguished by signature
Microarrays are capable of comparing concentrations
patterns of gene expression. Knowing the exact
of target oligonucleotides. This allows investigation
type of the disease is important for prognosis and
of responses to changed conditions. Unfortunately,
for selecting optimal treatment.
the precision is low. Moreover, mRNA levels, detected
• Drug selection. Genetic factors can be detected that
by the array, do not always quantitatively reflect
govern responses to drugs, which in some patients
protein levels. Indeed, usually mRNAs are reverse
render treatment ineffective and in others cause
transcribed into more stable cDNAs for microarray
unusual or serious adverse reactions.
analysis; the yields in this step may also be non-
uniform. Microarray data are, therefore, semiquanti- • Determination of gene function. A gene with an
tative: although the distinction between presence and expression pattern similar to genes in a metabolic
absence is possible, determination of relative levels pathway is also likely to participate in the pathway.
of expression in a controlled experiment is more • Target selection for drug design. Proteins showing
difficult, and measurement of absolute expression enhanced transcription in particular disease states
levels is beyond the capability of current microarray might be candidates for attempts at pharmacolo-
techniques. A change in expression levels of a gene gical intervention.
Analysis of microarray data 269
The raw data of a microarray experiment is an image The initial goal of data processing is a gene
in which the colour and intensity of the fluorescence expression table. This is a matrix containing relative
reflect the extent of hybridization to alternative expression levels, derived from the raw data. The
probes (see Figure 9.1). The two sets of targets are rows of the matrix correspond to different genes and
tagged with red and green fluorophores. If only one the columns to different sources of material. Of
target hybridizes, the spot appears green; if only the course, the gene expression table is not a simple
other target hybridizes, the spot appears red. If both ‘replica plate’ of the microarray itself. The micro-
hybridize, the colour of the corresponding spot array fluorescence pattern contains the raw data from
appears yellow. which the gene expression table must be extracted.
Extraction of reliable biological information from Data from many spots on the microarray will con-
a microarray experiment is not straightforward. tribute to the calculation of the relative expression
Despite extensive internal controls, there is con- level of each gene.
siderable noise in the experimental technique. In A typical experiment compares expression patterns
many cases, variability is inherent within the samples in material from two sources – perhaps a control of
themselves. Microorganisms can be cloned; animals known properties and a sample to be tested. We may
can be inbred to a comparable degree of homo- wish to compare organisms growing under different
geneity. However, experiments using RNA from experimental conditions and/or physiological states,
human sources – for example, a set of patients or DNA from different individuals or different tis-
suffering from a disease and a corresponding set of sues, or a series of developmental stages.
healthy controls – are at the mercy of the large Two general approaches to the analysis of a gene
individual variations that unrelated humans present. expression matrix involve (1) comparisons focused
Indeed, inbred animals, and even apparently iden- on the genes, i.e. comparing distributions of expres-
tical eukaryotic tissue-culture samples, show exten- sion patterns of different genes by comparing rows in
sive variability. the expression matrix; or (2) comparisons focused on
Data reduction involves many technical details samples, i.e. comparing expression profiles of differ-
of image processing, checking of internal controls, ent samples by comparing columns of the expression
dealing with missing data, selecting reliable measure- matrix.
ments, and putting the results of different arrays on
consistent scales. There is extensive redundancy in a • Comparisons focused on genes: how do gene expres-
microarray – each sequence may be represented by sion patterns vary among the different samples?
several spots, and in addition to straight duplicates, Suppose a gene is known to be involved in a dis-
they may correspond to different regions of a gene. ease, or linked to a change in physiological state
Probe pairs – one perfectly matching oligonucleotide in response to changed conditions. Other genes
and the other containing a deliberate mismatch – co-expressed with the known gene may participate
allow data verification. Different oligonucleotides in related processes contributing to the disease or
cover different segments of each region of interest in change in state. More generally, if two rows (two
the sequence. Typically, one gene may correspond to genes) of the gene expression matrix show similar
∼30–40 spots. expression patterns across the samples, this suggests
270 9 Microarrays and Transcriptomics
a common pattern of regulation and some relation- Depending on the origin of the samples, what is
ship between their functions, possibly including already known about them, and what we want to
(but not limited to) a direct physical interaction. learn, data analysis can proceed in different directions.
• Comparisons focused on samples: how do samples
1. The simplest case is a carefully controlled study,
differ in their gene expression patterns? A consistent
using two different sets of samples off known
set of differences among the samples may distin-
characteristics. For instance, the samples might be
guish and characterize the classes from which the
taken from bacteria grown in the presence or
samples originate. If the samples are from different
absence of a drug, from juvenile or adult fruit
controlled sources (for instance, diseased and
flies, or from healthy humans and patients with
healthy animals), do samples from different groups
a disease. We can focus on the question: what
show consistently different expression patterns?
differences in gene expression pattern characterize
If so, given a novel sample, we could assign it to its
the two states? Can we design a classification rule
proper class on the basis of its observed gene
such that, given another sample, we can assign it
expression pattern.
to its proper class? This would be useful in
How can we measure the similarity of different diagnosis of disease. Subject to the availability of
rows or columns? Each row or column of the expres- adequate data, such an approach can be extended
sion matrix can be considered as a vector in a space to systems of more than two classes.
of many dimensions. In the row vectors, or gene In computer science, training such a classifica-
vectors (a row corresponds to a gene), each position tion algorithm is called ‘supervised learning’. The
refers to the same gene in different samples. The gene expression pattern of each sample is given by a
vector has as many elements as there are samples. vector corresponding to a single column of the
In the column vectors, or sample vectors (a column matrix. This corresponds to a point in a many-
corresponds to a sample), each entry refers to a dif- dimensional space – as many dimensions as there
ferent gene in a single sample. The sample vector has are genes. In favourable cases, the points may fall
as many elements as there are genes reported. It is in separated regions of space. Then a scientist,
possible to calculate the ‘angle’ between different or a computer program, will be able to draw a
gene vectors, or between different sample vectors, to boundary between them. In other cases, separa-
provide a measure of their similarities. The smaller tion of classes may be more difficult.
the angle, the more similar the pattern. 2. In a different experimental situation, we might not
The gene vectors and sample vectors correspond, be able to pre-assign different samples to different
separately, to points in spaces of many dimensions. categories. Instead, we hope to extract the classifi-
The number of dimensions is either the number cation of samples from the analysis. The goal is to
of genes or the number of samples. We may not be cluster the data to identify classes of samples and
able to visualize easily points in a space with more then to investigate the differences among the genes
than three dimensions, but all of our intuition about that characterize them.
geometry works fine. For instance, it is natural to ask
whether subsets of the points form natural clusters – An intrinsic problem – and a severe one – in
points with high mutual similarity – characterizing interpreting gene expression data is the fact that the
either sets of genes or sets of samples. number of genes is much larger than the number of
We have already encountered clustering in relation samples. We are trying to understand the relationship
to phylogenetic trees. Similarly, in analysis of gene of one space of very many variables (the genes) to
expression arrays, after finding clusters we can bring another one (the phenotype) from only a few mea-
similar genes and samples together. This amounts sured points (the samples). The sparsity of the obser-
to reordering the rows and columns of the gene vations does not give us anywhere near adequate
expression matrix. The results are often displayed as coverage. Statistical methods bear a heavy burden in
a chart, coloured according to the difference in the analysis to give us confidence in the significance
expression pattern. Figure 9.2 contains an example. of our conclusions.
Analysis of microarray data 271
5.0
4.0
3.0
2.0
OH
O
H3 C
HO OH
1.0 H3C
CH3
F
O
Dexamethasone
0.1
0.0
Vehicle Dexamethasone
Figure 9.2 Glucocorticoids are regulators of immune system physiology, widely used as anti-inflammatory drugs, for instance in asthma
and arthritis. However, cataract formation is a common side effect of prolonged use.
Glucocorticoids are well known to affect gene expression patterns. The steroid binds to a glucocorticoid receptor bound on a cell
surface. Like other steroid hormone receptors, the glucocorticoid receptor is a zinc-finger transcription factor. The ligated receptor
dimerizes and translocates to the nucleus where it binds a DNA sequence called the glucocorticoid response element, activating other
transcription factors to modulate expression of target genes.
Gupta and co-workers studied the effect of dexamethasone, a hydrocortisone analogue, on the protein expression pattern in cultured
epithelial cells of the human lens.* Three samples were treated with 1 mM dexamethasone for 4 hours. Three corresponding controls
were given only ‘vehicle’. (In pharmacology, a vehicle is the medium of delivery of the drug.) RNA was extracted, purified, and
converted to cDNA. The microarray used was an Affymetrix chip that contained 22 283 known human transcripts and expressed
sequence tags (ESTs) for about 15 000 genes.
The figure shows data for these six samples. Levels of expression enhanced or reduced by >1.5-fold were considered significant.
Data in red correspond to genes upregulated by dexamethasone; data in green correspond to downregulated genes. The scale on
the right relates colour to expression change. The results identified 93 upregulated transcripts and 43 downregulated ones.
In the figure, the data are clustered based on overall expression level (gene vector, according to rows) and also based on expression
on different chips (sample vector, according to columns). Note that the clusterings by gene and by sample are independent: it would be
possible to change the arrangement of the columns without altering the arrangement of the rows and vice versa.
The trees at the top and at the left indicate the similarities among the results, according to sample vector and gene vector,
respectively. The sample-vector tree at the top cleanly separates the control triple replicates (vehicle) and the dexamethasone-treated
triple replicates. For both control and dexamethasone-treated replicates, the observed expression changes appeared in at least two
of the three measurements. The gene-vector tree at the left has a major breakpoint between downregulated and upregulated genes
(for one gene the behaviour of different replicates is inconsistent).
To discover the implications of these data for cataract induction, the next step would be to examine the biological functions of the
glucocorticoid-sensitive genes. Modern bioinformatics software makes the transition to this computation a facile one. In this case, Gupta
et al. found that the modulated genes involved a wide range of functions. It appears that glucocorticoid elicits a network of responses
that must be pursued downstream to make direct contact with the mechanism of cataract formation.
* Gupta, V., Galante, A., Soteropoulos, P., Guo, S., & Wagner, B.J. (2005). Global gene profiling reveals novel glucocorticoid induced changes in gene
expression of human lens epithelial cells. Mol. Vis. 11, 1018–1040.
272 9 Microarrays and Transcriptomics
A fundamental question in biology is how different As an example from higher organisms, we shall
components of cells smoothly integrate their activ- look at changes in gene expression patterns in sleep
ities. Measurements of expression patterns tell us and wakefulness.
part of the story. They provide an inventory of the
components but suggest only inferentially how they The diauxic shift in Saccharomyces cerevisiae
interact.
Comparisons of alternative physiological states of Yeast is capable of adapting its metabolism to a vari-
an organism offer the possibility of extracting, from ety of environmental conditions. In the presence of
an entire genome, a subset of genes that underlie a glucose, Saccharomyces cerevisiae will – even in the
particular life process. An example of shift in physio- presence of oxygen – preferentially use the Embden–
logical state in microorganisms is diauxy. Diauxy, Meyerhof fermentative pathway, reducing glucose to
or double growth, is the switch in metabolic state of ethanol. Exhaustion of available glucose produces
a microorganism when, having exhausted a preferred the ‘diauxic shift’ to oxidative metabolism, sending
nutrient, it ‘retools’ itself for growth on an alterna- the products of fermentation through the Krebs
tive. The organism may show a biphasic growth (tricarboxylic acid) cycle and mitochondrial oxida-
curve, with a lag period while the changed com- tive phosphorylation (see Figure 9.3).
plement of proteins is synthesized. The shift is not merely a redirection of metabolic
flux through alternative pathways using pre-existing
• Jacques Monod discovered diauxy approximately proteins but involves protein synthesis, with a sub-
70 years ago, during his predoctoral work in Paris. He stantial change in expression pattern. Many genes
described his observations in his 1941 thesis. Resuming are involved, not only enzymes in the awakened met-
his research career after the war, he and his colleagues abolic pathways. Another consequence of switching
at the Institut Pasteur made their fundamental dis- to respiratory metabolism is the danger of oxidative
coveries about the mechanism of gene regulation. damage, and the oxidative stress response requires
enhanced expression of many other genes.
The diauxic shift in yeast is the transition from fer-
mentative to oxidative metabolism upon exhaustion • Oxygen is essential for aerobic life, yet its reduced
of glucose as an energy source. In this chapter, we forms include some of the most toxic substances with
shall examine the effects of the reconfiguration of the which cells must cope.
expression patterns, comparing the different comple-
ments of genes active in the two states. In Chapter 11, The cells are effectively sensing the level of glucose.
we shall return to the yeast diauxic shift to examine As glucose itself acts as a repressor of expression of
the control mechanisms that regulate the transcrip- many genes, depletion of glucose releases this repres-
tional reprogramming. sion – ‘turning on’ a variety of genes.
Expression patterns in different physiological states 273
D=0.05
in Figure 9.3).
7.25
7.75
8.25
8.75
9.25
9.75
10.0
7.5
8.0
8.5
9.0
9.5
• Genes related to protein synthesis show a decrease
in expression level. These include genes for ribo-
somal proteins, tRNA synthetases, and initiation
and elongation factors. An exception is that genes
encoding mitochondrial ribosomal molecules have
generally enhanced expression.
• Genes related to a number of biosynthetic path-
ways show reduced expression. These include
genes encoding enzymes of amino acid and nucleo-
tide metabolism.
• Genes involved in defence against oxidative stress
from reactive oxygen species show enhanced
expression. Proteins encoded include catalases,
peroxidases, superoxide dismutases, and glutathi-
one S-transferases.
It is always useful to study a biological phenom- The logic is that if the expression pattern of a gene
enon in the simplest organism that exhibits it, as well differs between the sleep-deprived and spontaneously
as in humans. Comparisons of flies and mammals may awake flies, both in a waking state, the controlling
reveal fundamental common features disguised within factor must be time of day.
the individual complexities of different species. The A gene was classified as sleep related if its expres-
results may also illuminate the functions associated sion was elevated by a factor >1.5 in spontaneously
with homologous genes. This is not to deny that, asleep flies relative to both spontaneously awake and
in the realm of cognitive phenomena, humans have sleep-deprived flies. A gene was classified as wakeful-
pushed things farther than other species and present ness related if its expression was elevated by a factor
unique features. >1.5 in both spontaneously awake and sleep-deprived
Sleep disorders present common and serious flies relative to spontaneously asleep flies. Some genes
medical problems. Sleep deprivation is a major cause showed the influence of both physiological state and
of accidents, leading to loss of life and damage to time of day.
health, property, and productivity. Indeed in rats and Cirelli and co-workers studied expression patterns
flies, prolonged sleep deprivation is itself fatal, after of ∼10 000 genes of the fly. Of these, 121 were
slightly more than 2 weeks. wakefulness related and 12 were sleep related. The
C. Cirelli and co-workers studied gene expression expression of a partly overlapping set of 130 genes
patterns of rats and fruit flies in three physiological was moderated by time of day: 87 were more highly
states: spontaneously awake, spontaneously asleep, expressed at 4 p.m. and 43 were more highly
and sleep deprived. Protocols were similar but not expressed at 4 a.m.
identical in the experiments on the two species.* The overlap of sleep/wakefulness-related genes
Flies were prepared by accustoming them to an with those modulated by time of day demonstrated a
alternating regimen: 12 hours with the lights on, relationship between homeostatic and circadian regu-
awake (8 a.m. to 8 p.m.) and 12 hours in the dark, lation. Two-thirds of sleep-related genes and one-fifth
asleep (8 p.m. to 8 a.m.). A group of flies was then of wakefulness-related genes were modulated by time
sleep-deprived for 8 hours after the normal end of the of day. This is consistent with the observation that
waking period, i.e. from 8 p.m. to 4 a.m. flies with mutations in circadian genes can show
A complication in studying the effect of sleep and abnormal homeostatic regulation of sleep.
wakefulness arises from the circadian rhythms that In both flies and rats, the genes preferentially
the expression patterns of many genes are known to expressed in wakefulness and sleep fell into several
obey. The use of the sleep-deprived animals allowed different functional categories.
the effects of waking and sleep states to be distin- In flies, genes preferentially expressed in waking
guished from time-of-day effects. To this end, sam- states included those encoding proteins involved in
ples of spontaneously asleep and sleep-deprived flies detoxification, including cytochrome P450s and
were collected at 4 a.m. Samples from spontaneously glutathione S-transferases; genes involved in defence
awake flies were collected at 4 p.m. (Table 9.2). against immune challenge and in lipid, carbohydrate,
and protein metabolism; a transcription factor; a
Table 9.2 Sleep and wakefulness states of flies
nuclear receptor; and the circadian gene crypto-
State Time of sample collection chrome. Genes preferentially expressed in sleep
included the glial gene anachronism, the gene encod-
4 p.m. 4 a.m.
ing the catalytic subunit of glutamate–cysteine ligase,
Awake Spontaneously awake Sleep deprived and other genes involved in lipid metabolism.
Asleep Spontaneously asleep In rat brains, a much larger number of genes than
in flies had raised expression levels during wakeful-
* Cirelli, C., LaVaute, T.M., & Tononi, G. (2005). Sleep ness. (For rats, a less strict criterion of significant
and wakefulness modulate gene expression in Drosophila.
change in expression level was applied: a ratio of
J. Neurochem. 94, 1411–1419; C. Cirelli (2005). A mole-
cular window on sleep: changes in gene expression between >1.2 compared with a ratio of >1.5 in flies.) In con-
sleep and wakefulness. Neuroscientist 11, 63–74. trast to flies, which show fewer sleep-related than
276 9 Microarrays and Transcriptomics
Table 9.3 Enhanced expression of genes in the rat in sleep and wakefulness
Learning and memory Synaptic plasticity: acquisition, potentiation Synaptic plasticity: consolidation, depression
Transport – Membrane trafficking and maintenance
Metabolism Energy metabolism Cholesterol biosynthesis
Transcription (positive regulation) Transcription (negative regulation)
Translation (negative regulation) Translation (positive regulation)
Translation Translation
Stress response General stress response and unfolded protein response
Cell signalling Depolarization-sensitive Hyperpolarization-promoting (leakage)
Glutamatergic neurotransmission GABAergic neurotransmission
From: Cirelli, C. (2005). A molecular window on sleep: changes in gene expression between sleep and wakefulness. Neuroscientist 11, 63–74.
wakefulness-related genes, in the rat approximately MAP kinase phosphatases, and for proteins involved
the same number of genes showed enhanced expres- in cholesterol synthesis.
sion in sleep as in wakefulness, as shown in Table 9.3. Another window on to the molecular biology of
The genes with enhanced expression are associated sleep is the effect of mutation. A fruit fly mutant,
with different biological functions. Shaker, sleeps approximately one-third as long as
In the rat, wakefulness-related genes are associated wild-type flies. The gene involved encodes a potas-
with memory acquisition, energy metabolism, tran- sium channel that affects neural electrical activity.
scription activation, cellular stress, and excitatory Some humans can get by regularly on only 3–4 hours’
neurotransmission. Sleep-related genes are associated sleep per 24 hour interval, but it is not known
with a potassium channel, translation machinery, whether this trait is under the control of the homo-
long-term memory, and membrane trafficking and logous gene. However, what does suggest a link to
maintenance, including synthesis and transport of the fly mutant is a rare human disorder, Morvan’s
glia-derived cholesterol. Cholesterol is a major com- syndrome, the symptoms of which include insomnia.
ponent of myelin and other membranes. (Membrane At least one case of Morvan’s syndrome has appeared
maintenance suggests an analogue, at the molecular to be of immune origin, involving an autoantibody
level, of Shakespeare’s metaphor that sleep ‘. . . knits against a potassium channel.
up the ravelled sleeve of care . . .’.)
Similarities between rats and flies in the functional
• If the physiology of sleep is not well understood, the
categories of sleep- and wakefulness-associated genes
molecular biology of sleep is even more obscure.
are interesting. Wakefulness-associated genes in both
Measurements of changes in gene expression during
species include those for members of the Egr family
sleep and wakefulness give clues as to what distin-
of transcription factors, the mammalian NGF1-B guishes the states, at the molecular level.
nuclear receptor and the fly orthologue, homologous
often exhibit parallel expression patterns and similar From: Arbeitman, M.N., et al. (2002). Gene expression during the life
cycle of Drosophila melanogaster. Science 297, 2270–2275. Copyright
perturbations of these patterns in mutants. For AAAS. Reproduced by permission.
instance, the expression pattern of the eyes absent
mutant, which produces an eyeless or at least a reduced
eye phenotype, forms a cluster, in the analysis of
expression patterns, with 33 genes. Of these, 11 are
278 9 Microarrays and Transcriptomics
already known to function in eye differentiation or activity is minimal. In contrast, larval and adult stages
phototransduction. The other 22 are likely to as well; have a more active lifestyle but stasis of anatomical
the data provide at least hypotheses, and at best form, although larvae but not adults grow substanti-
reliable clues, to the function of these genes. ally in size. Consistent with the notion of a ‘go back
and get it right this time’ aspect to metamorphosis is
Different life stages make different demands on the occurrence of some dedifferentiation in the pupa.
different genes Ideas of this sort can be examined in light of the
Different genes exhibit different temporal patterns of nature of genes that show different lifelong expres-
expression. Most of the developmentally modulated sion patterns. For example, maximal expression levels
genes are expressed in the embryonic stage, as the of most metabolic genes occur during larval and
whole system is getting started. Genes expressed in adult stages. Another set of genes involved in larval
the early embryo include transcription factors, pro- and adult muscle development has a similar two-peak
teins involved in signalling and signal transduction, expression pattern. More precise analysis is possible
cell-adhesion molecules, channel and transport pro- and shows that steps in the regulatory hierarchy for
teins, and biosynthetic enzymes. A third of these are muscle development show peaks at different times,
maternally deposited genes; many of these fall off in with genes expressed later being downstream in the
expression level within 6–7 hours. regulatory hierarchy. Similar time-of-onset sequences
appear in both larvae and pupae. The two stages at
• Stathopoulos and Levine* comment that ‘. . . the which body plans are formed – embryo and pupa –
genesis of a complex organism from a fertilized egg is re-utilize not only the same materials but also some
the most elaborate process known in biology, and of the same mechanisms.
thereby depends on “every trick in the book”’.
• Measurement of expression patterns at different stages
The genes studied include one large stage-specific reveal which batteries of genes are active in develop-
ment. Particularly striking, in Drosophila, is the alterna-
class (36.3%) that shows a single major peak of expres-
tion of time-of-onset of the expression patterns of
sion. Some of these remain constitutively expressed
some genes: embryo and pupa show similarities, and
(at lower levels) subsequently. Others show sharp
larva and adult.
peaks in expression level.
Another group of genes (40.3%) shows two peaks
in expression. Two patterns are common: genes with Flower formation in roses
their first onset of enhanced expression early in
Roses have long been favourites of gardeners and
embryogenesis generally have their second at pupa-
lovers: for their appearance, for their scent, and, per-
tion, with elevated expression levels continuing into
haps less commonly, for their purported medicinal
the pupal stage. Genes with their first onset of
qualities – rose hips (fruits) are rich in vitamin C.
enhanced expression late in embryogenesis generally
Breeders have responded by creating tens of thou-
have their second at the late pupal stage, with elevated
sands of varieties.
expression levels continuing into the adult stage. The
remaining 23.4% of genes show multiple peaks in
expression level. • Roses have been known in Europe since antiquity and
appear frequently in literature, famously in works by
The observation of similarities between expression
Dante and Shakespeare. Roses symbolized the two
patterns in embryonic stages and pupal stages, and
armies fighting for control of England at Bosworth
between larval and adult stages, is interesting. Certain
in 1485.
analogies suggest themselves. In both embryonic and
pupal stages, body structures are forming and physical
Roses have been cultivated for 5000 years. Rose
* Stathopoulos, A. & Levine, M. (2002). Whole-genome varieties brought to Europe from China at the end of
expression profiles identify gene batteries in Drosophila. the 18th century introduced the very desirable pro-
Dev. Cell 3, 464–465. perty of recurrent blooming throughout the season,
Expression pattern changes in development 279
• cell growth;
• cell biogenesis and organization;
• cell rescue;
• signal transduction; and
OPP
• unknown functions. Farnesyl Germacrene D
diphosphate
Taking, as a threshold of significance, a twofold dif-
ference in expression level, 77 genes (about one-fifth) Figure 9.8 Conversion of precursor farnesyl diphosphate to
germacrene D, a common scent molecule produced by rose petals.
had higher expression levels in Fragrant Cloud than
in Golden Gate, and only three had higher levels in
Cloning of the (+)-d-cadinene synthase gene to pro-
Golden Gate; the rest had similar expression levels in
duce recombinant enzyme permitted direct functional
both varieties. Comparing Fragrant Cloud flowers
studies. It catalyses the reaction of farnesyl diphos-
in stage 1 and stage 4, 65 genes had higher expression
phate to produce the sesquiterpene germacrene D,
levels in stage 4 and 14 had lower levels. Common to
the major sesquiterpene component of the scent of
both sets were 40 genes that were more highly ex-
Fragrant Cloud (see Figure 9.8).
pressed in both Fragrant Cloud relative to Golden Gate
and in Fragrant Cloud stage 4 relative to Fragrant Colour: the elusive blue rose
Cloud stage 1 flower development (see Figure 9.7).
A walk through many neighbourhoods in temperate
What are these 40 genes? Fifteen are involved in
climates will reveal the many colours that roses
metabolism and seven appear to have roles in sec-
can display, notably red, pink, peach, and yellow.
ondary metabolism, i.e. reactions outside of the core
Conspicuous by its absence is blue. It has not been
metabolic pathways responsible for normal growth,
possible to produce a blue rose by classical selective
development, and reproduction. Two very strongly
breeding. This is not for lack of trying: in 1840 the
upregulated genes encode proteins similar in sequence
horticultural societies of Great Britain and Belgium
to known enzymes: glutamate decarboxylase, and
offered a prize of 500 000 francs for one.
(+)-d-cadinene synthase. The enzyme (+)-d-cadinene
The major pigments of flower petals are antho-
synthase is involved in the synthesis of sesquiter-
cyanins: cyanidin, pelargonidin, and delphinidin. All
penes, one of the classes of floral scent molecule.
are synthesized from a common precursor, dihydro-
kaempferol (see Figure 9.9). It is delphinidin that is
25
Fragrant Cloud: blue, but roses lack the gene for the enzyme that
stage 4 > stage 1
produces dihydromyricetin from dihydrokaempferol
40 Common (marked by a green * in the figure).
Fragrant Cloud >
Scientists at the Australian Commonwealth Scien-
37 Golden Gate tific and Industrial Research Organisation (CSIRO),
in collaboration with a Melbourne company, Flori-
Figure 9.7 Two rose cultivars, Fragrant Cloud and Golden Gate, gene, and the Suntory Corporation of Japan, have
differ in their maximal odour production in stage 4 of flower created a blue rose by genetic engineering. There
development: Fragrant Cloud is rich in scent and Golden Gate is were two challenges: to produce the blue pigment
poor. Scent development in Fragrant Cloud is fully developed in
and to turn off synthesis of red and orange pigments.
stage 4 relative to the immature flowers in stage 1.
In comparisons of expression patterns, 25 genes were expressed
To suppress the red and orange pigments, the target
more highly in stage 4 Fragrant Cloud rose flowers than in stage 4 was the enzyme dihydroflavonol reductase (DFR).
Golden Gate rose flowers; 37 genes were expressed more highly DFR modifies precursor molecules in all three
in stage 4 Fragrant Cloud rose flowers than in stage 1 Fragrant branches of the pathway; a DFR− plant would have
Cloud rose flowers; and 40 were expressed more highly in stage 4 white flowers, providing a ‘clean slate’ for production
Fragrant Cloud flowers than in both stage 4 Golden Gate flowers
and display of blue pigments. However, DFR is also
and stage 1 Fragrant Cloud flowers.
These experiments can help to identify proteins preferentially needed for synthesis of delphinidin.
expressed in stage 4 Fragrant Cloud flowers that are involved in It was necessary to manipulate the pathways with
scent biosynthesis. precision. An interfering RNA (RNAi) was engineered
Expression pattern changes in development 281
OH OH
OH OH OH
HO O HO O HO O OH
OH OH OH
OH O OH O OH O
Dihydroquercetin Dihydrokaempferol * Dihydromyricetin
Figure 9.9 Simplified scheme showing structures and metabolic relationships of anthocyanin flower pigments. The difficulty in breeding
a blue rose is that roses lack the gene that encodes the enzyme flavonoid 3′,5′-hydroxylase (F3′,5′H), which produces the precursor of
the blue pigment delphinidin. In other plants, this enzyme acts at the point marked by a green *. The steps corresponding to the vertical
arrows are all catalysed by a single enzyme, dihydroflavonol reductase (DFR).
Learning and memory involve changes in the structure known to be involved in memory formation, espe-
and biochemistry of nerve cells. Nervous systems cially spatial memory (see Box 9.2). Following high-
are dynamic networks, passing signals among cells. frequency stimulation for four separated 1-second
Synapses are the sensitive points at which neurons intervals, total RNA was extracted after several time
interact. As each neuron integrates inputs from several intervals – 30, 60, 90, and 120 minutes after stimula-
others, its output depends on the ‘weighting’ of tion – and then reverse transcribed into cDNA and
its inputs – the distribution of the strengths of the hybridized onto Affymetrix GeneChip arrays. The
input synapses. Increasing or reducing the strengths chip reported 12 000 genes and ESTs. Samples of
of individual synaptic connections modulates the unstimulated tissue provided controls.
dynamics of the network. Park and colleagues found, within all time points,
Learning and memory must involve some perman- 1664 genes with statistically significant changed
ent structural change. The observation that memory expression patterns. Of these, 39% were upregulated
survives periods of coma, during which neural activity and 61% downregulated. The genes identified suggest
ceases, proves this. The change in the network cannot, that LTP produces changes in a variety of processes
therefore, be purely a change in the dynamic state. affecting cell morphology and affects interactions
Long-term potentiation (LTP) is a neural phenom- among cells and between cells and the extracellular
enon underlying learning and memory. LTP is a per- matrix.
sistent increase in strength of a synaptic connection Specific functional assignment showed several cat-
as a result of stimulation of the upstream cell. The egories of genes, shown in Table 9.5 (this table shows
original observation, first described by Bliss and a composite containing genes identified as having
Lomo in 1973, was that high-frequency stimulation changed expression at any of the time points; see also
of a synapse during a finite time interval produced Figure 9.11). Most but not all of the categories con-
a persistent subsequent enhancement of the post- tain examples of both up- and downregulated genes.
synaptic response. It is believed that transient effects,
lasting <1 hour, require modifications of pre-existing
proteins at the synapse. Longer-lasting effects involve BOX London taxi drivers: spatial
9.2 memory and the hippocampus
protein synthesis and gene transcription, and result-
ing structural remodelling of synapses.
London taxi drivers must have an exhaustive knowledge
of the metropolitan geography, of optimal routes
• Learning must be regarded as a specialized form of between points both famous and obscure, and of varia-
development. tions in traffic patterns. As portrayed in the famous
1979 film The Knowledge, drivers must pass a strict test
to earn their licence before taking control of a classic
In addition to neural plasticity, learning also
black London cab. Consistent with the involvement of
involves generation of new neurons. LTP stimulates
the hippocampus in spatial memory, brain scans show
both neurogenesis and enhanced survival of new cells.
that London taxi drivers have a larger hippocampus than
Park and co-workers studied genes that change
a control group, and that the hippocampus enlarges
their expression pattern in response to LTP.* They
with time spent behind the wheel.*
studied cells from the mouse dentate gyrus, a struc-
* Maguire, E.A., Gadian, D.G., Johnsrude, I.S., et al. (2000).
ture within the hippocampus. The hippocampus is Navigation-related structural change in the hippocampi of taxi
drivers. Proc. Natl. Acad. Sci. USA 97, 4398–4403; Maguire,
E.A., Spiers, H.J., Good, C.D., Hartley, T., Frackowiak, R.S., &
* Park, C.S., Gong, R., Stuart, J., & Tang, S.-J. (2006). Burgess, N. (2003). Navigation expertise and the human hip-
Molecular network and chromosomal clustering of genes pocampus: a structural brain imaging analysis. Hippocampus
involved in synaptic plasticity in the hippocampus. J. Biol. 13, 250–259.
Chem. 281, 30 195–30 211.
Expression patterns in learning and memory: long-term potentiation 283
120 min
Control
30 min
60 min
90 min
Gene name Gene function
Galactosylceramidase Myelin metabolism
Cerebellin 1 precursor protein Neuropeptide
42 kD cGMP-dependent protein kinase anchoring protein Synaptic plasticity
Striatin Ca2+ signalling in spines
Figure 9.11 Clustering of genes differentially expressed after induction of LTP. The scale at the bottom indicates the expression level
relative to the control: green = enhanced expression; red = reduced expression. Brackets indicate clusters of genes with known neural or
synaptic functions.
From Park, C.S., et al. (2006). Molecular network and chromosomal clustering of genes involved in synaptic plasticity in the hippocampus. J. Biol.
Chem. 281, 30 195–30 211.
284 9 Microarrays and Transcriptomics
Table 9.5 Genes that change expression pattern in response to One puzzling gene is CDC25B, an oncogene
long-term potentiation (LTP) encoding a tyrosine phosphatase, which functions
as a cell-cycle regulator. Use of a specific inhibitor
Functional category Upregulated Downregulated
of CDC25B protein product blocked LTP. CDC25B
The extracellular matrix + + must, therefore, have an essential role, but the mech-
and its regulation
anism remains obscure. This is precisely the kind of
Membrane protein/cell + + unexpected connection that high-throughput methods
surface/adhesion molecule
can turn up.
Neurosteroid hormone − + Some of the differentially expressed genes are
metabolism
coordinated into coherent pathways. These include
Cytokine/growth factor/ + +
enhanced expression of genes in the MAPK signalling
receptor
cascade (which was already known to be important
Other receptors/signalling + +
in LTP) and the Wnt signalling pathway (which had
Ion channel + +
not previously been connected with LTP).
Transcription factor/regulation + + Comparison of expression patterns at different
Translation + + times after LTP induction revealed the temporal
Neurotransmitter receptor/ + + expression patterns of different genes. An interesting
neuromodulator observation is that many genes in the same general
Regulation of cytoskeleton + + functional groups have similar temporal expression
Mitochondrial/energy + + profiles. Conversely, different time points are asso-
production ciated with a ‘schedule’ of activity of genes with
Proteases/protease inhibitors + + particular types of function.
Immunoresponsive + +
proteins/oxidative stress/ • Genes with common time profiles. Genes involved
neuroprotection/cell death in responses to external stimuli are upregulated at
Myelin-related proteins − + 30 and 60 minutes after LTP induction. These may
Chromatin structure + − be involved in interactions between pre- and post-
synaptic components. Genes involved in signal
transduction and transcription regulation provide
less-clear time profiles. It is likely that their effects
Expression patterns can identify particular genes
are indirect.
involved in neural plasticity. Many of the genes iden-
tified by changed expression patterns were already • Events happening at particular times. Many genes
known to play roles in synaptogenesis, synapse dif- active at 30 minutes are involved in cell–cell inter-
ferentiation, neurite outgrowth, and synaptic plastic- action, synapse formation and remodelling, and
ity. Others were not previously known to be involved neurite outgrowth. These represent a relatively early
in LTP, but their altered expression pattern makes component of the response. Genes related to the
them candidates to be tested for their effects on LTP. cytoskeleton are downregulated at 30 minutes, but
For example, transglutaminase is known to be upregulated at 60, 90, and 120 minutes. These may
expressed in neural tissue and appears in synapses. be involved in structural changes at the synapse.
However, it was not known to be implicated in
LTP. A connection was confirmed by showing that
Conserved clusters of co-expressing genes
cystamine – a specific antagonist of transglutaminase
– impairs LTP. (Cystamine is an inhibitor of trans- Mapping the loci of the differentially expressed genes
glutaminase and also causes disulphide exchange showed that they are concentrated in specific chro-
producing unfolding. Transglutaminase also helps to mosomal regions (tandem duplicates were removed).
produce protein aggregates in Huntington’s disease. These clusters tended preferentially to contain genes
Cystamine does ameliorate Huntington’s disease in with similar functions. Comparison of the distribu-
the mouse, but the mechanism is unclear.) tions of homologues in the genomes of rats, humans,
Evolutionary changes in expression patterns 285
Drosophila, and C. elegans suggest that the cluster- contribute to a mechanism for common regulation of
ing is conserved in evolution. The clustering may expression, reminiscent of a bacterial operon.
The very high similarity in genome sequence between in transcription factors. This is in accord with King
humans and chimpanzees suggests that the evolu- and Wilson’s hypothesis.
tionary differences between such closely related • The differences in expression pattern are not
species would not lie primarily in the relatively small uniform in different tissues. As far as expression
changes in sequences of individual proteins, but in pattern is concerned, our hearts and livers have
expression patterns. Microarrays permit a test of this diverged from chimpanzees in expression pattern
idea. Nevertheless, amino acid sequence changes in more than our brains, both in terms of the
proteins may be significant, even though they may numbers of differentially expressed genes and the
be small. One example is the FOXP2 gene, discussed amount of the differences in transcription levels.
in Chapter 4. Another is a contributor to the control (Would you have guessed this?) However, looking
of overall cerebral cortical size (a crude but not at the course of evolution using the macaque, for
entirely irrelevant feature of our mental evolution), instance, as an outgroup, shows that the human
the gene ASPM (abnormal spindle-like microcephaly- brain is particularly rich in genes with increased
associated), which has undergone positive selection expression levels relative to the chimpanzee,
in the lineage leading to humans. consistent with the distinct differences in cognitive
abilities. In other tissues, there is a more even dis-
• The idea of the importance of changes in expression tribution of genes expressed more highly in humans
pattern to evolution appeared in a seminal paper by or more highly in the chimpanzee.
M.C. King and A.C. Wilson in 1975.
• Changes in expression patterns tend to be lower in
X and higher in Y chromosomes than in autosomes.
The design of the experiments presents a number For brain tissue, the average human/chimpanzee
of difficulties, however. ratio of expression level is about 1.51 for auto-
• There is a high background of variation in expres- somes, 1.43 for the X chromosome, and 2.14 for
sion pattern among different individuals of any the Y chromosome.
species and among different tissues within any • Duplicated genes tend to show a higher divergence
individual. This makes it difficult to identify changes in expression pattern than non-duplicated genes.
unambiguously attributable to species differences. One possible consequence of gene duplication is
• Use of a microarray containing oligomer sequences divergence and specialization of function. This is
derived from human genes to measure mRNA consistent with the requirement for differential
levels in chimpanzee tissue underestimates the control of gene expression. (Recalling Chapter 1,
expression levels in the chimpanzee because of vertebrate a- and b-globins are exceptions to this
less-effective hybridization resulting from sequence paradigm: it is necessary to calibrate their levels of
changes. expression. It is not yet clear what mechanism
achieves this. Synthesis of different amounts of
Nevertheless, carefully controlled experiments show a- and b-globin causes thalassaemias.)
that there are significant differences between human
• The differences in expression patterns are not
and chimpanzee expression patterns that arose dur-
uniform, even across different autosomes. During
ing evolution.
the 6–7 million year period of divergence, there
• The set of genes that show different expression has been substantial chromosome rearrangement
patterns between humans and chimpanzees is rich between humans and chimpanzees (see Figure 3.6).
286 9 Microarrays and Transcriptomics
2.2
2.0
Gene expression ratio
1.8
1.7
1.6
1.5
1.4
1.3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Chromosome
Figure 9.12 Average ratio of gene expression levels between human and chimp cortical tissue in collinear (red) and rearranged (blue)
chromosomes. X and Y chromosomes shown in green.
After: Marquès-Bonet, T., Cáceves, M., Bertranpetit, J., Preuss, T.M., Thomas, J.W., & Nawarro, A. (2006). Chromosomal rearrangements and the
genomic distribution of gene-expression divergence in humans and chimpanzees. Trends Genet. 20, 524–529.
Table 9.6 Areas of brain for which human/chimpanzee expression pattern differences were measured
In cortical tissue, changes in expression patterns are Differences between the species must be extracted
larger among genes in rearranged chromosomes from the variation among individuals. Within each
than syntenic ones (see Figure 9.12). species, typically a few hundred genes (out of
∼10 000 tested) vary in expression in different brain
The comparative analysis of expression patterns in regions. There is relatively low variation within the
human and chimpanzee brains has been pursued to
high resolution. Khaitovich and colleagues used an
array containing ∼10 000 human genes and arrays con- * Khaitovich, P., et al. (2004). Regional patterns of gene
taining ∼40 000 human transcripts to test several areas expression in human and chimpanzee brains. Genome Res.
of human and chimpanzee brains* (see Table 9.6). 14, 1462–1473.
Applications of microarrays in medicine 287
cortex itself, but over 1000 genes show differences showing species-specific differences in expression not
in expression pattern between the cerebellum and shared by other regions. An analysis of functional
other regions. categories of the genes showing enhanced or reduced
It is surprising to observe a similarity of expression interspecies expression differences does not reveal an
pattern in humans between Broca’s area, associated enrichment in specific families.
with speech, and the homologous right-hemisphere If our goal is to understand at the molecular level
area, which is not. This observation suggests that the the phenotypic differences between humans and
achievement of language in humans did not depend chimpanzees involving higher mental functions such
on localized changes in transcription patterns. as cognition and language, our data must be very
Functional analysis showed that genes encoding accurate and detailed, for as the traits grow more
proteins involved in signal transduction, cell–cell com- subtle, the molecular signal grows correspondingly
munication, differentiation, and development show a fainter. At some point, it will be necessary to trace the
greater-than-random tendency to vary in expression origin of the changes in expression patterns to pro-
between regions, within both species. Genes encoding tein and genomic sequences. This does not contradict
proteins involved in protein synthesis and turnover the King and Wilson hypothesis; after all, amino acid
tend to show conserved expression patterns. sequence changes modulate the functions of regula-
Approximately 10% of the genes studied differ in tory proteins. We may also be required to address
expression pattern between humans and chimpanzees. different levels of complexity: the different traits may
Most of the differences appear in two or more regions depend on very complicated patterns of interactions,
of the brain. The cerebellum contains several genes difficult to infer from properties of individual genes.
OH
OH
OH NH2
H2N OH H3C O
CH3 O
O HO CH2OH
H3C O
HO O CH2OH OH
O CH3 O
O CI O O HO CH2OH
NH2 H H O OH
O O O
H O OH
H CI
OH OH O O O
HO H H
CI O O N NHCH3
O O N
H H H H H N N H N
O H NHCH3 H H H H
N
N N
N
N HN H H O O H
H H H H H
HN O H2NC O HOOC
H3C H
R
HOOC H O
CH3 OH OH O
HO OH H3C
OH O
HO OH HO
HO
OH
Vancomycin α-Avoparcin R H
β-Avoparcin R CI
Figure 9.13 (left) Vancomycin, a glycopeptide antibiotic produced by the bacterium A. orientalis. (right) Antibiotics a- and b-avoparcin,
related to vancomycin and produced by Streptomyces candidus.
From: Lu, K., Asano, R., & Davies, J. (2004). Antimicrobial resistance gene delivery in animal feeds. Emerg. Infect. Dis. 10, 679–683.
Table 9.8 Genes in vancomycin resistance cluster 35 genes consistently showed increased expression,
some as high as 30-fold, and 16 consistently showed
Gene Action of gene product
decreased expression.
VanH Reduces pyruvate to D-lactate Genes upregulated with increased vancomycin
VanA Esterifies D-ala–D-lactate resistance are associated with the following:
VanX Hydrolyses D-ala–D-ala, leaving D-ala–D-lactate
• purine biosynthesis, which is a large component of
to build the cell wall
the change in expression: 15 of the 35 upregulated
VanS A kinase that senses vancomycin and initiates
genes involved purine biosynthesis or transport,
transcription of the other genes. In the absence
of vancomycin, they are not expressed. and there was a mutation in the regulator of the
purine biosynthesis operon;
number of structural changes, including reduced • cell envelope synthesis, remodelling, and
growth rate, reduction in cell wall cross-links, degradation;
increased cell-wall thickness, and the appearance • proteins involved in transport and binding of
of d-glutamic acid instead of d-glutamine in the amino acids, peptides and amines, and nucleic acid
peptide. The genomic changes responsible for the components (including purines);
VISA phenotype can be magnified by vancomycin • synthesis of staphyloxanthin, an orange carotenoid
challenge and selection to produce VRSA strains that gives S. aureus its golden colour;
with MIC = 32 mg/ml. • folic acid synthesis; and
The alternative, for the bacterium, is a counter-
• unknown functions.
attack on vancomycin, to ‘pull its sting’. S. aureus
has achieved high vancomycin resistance by picking Genes downregulated with increased vancomycin
up a specific plasmid from a resistant Enterococcus. resistance are associated with:
The plasmid contains a cluster of genes, leading
• energy metabolism;
to changing the d-ala–d-ala at the C terminus of
the cross-linking pentapeptide to d-ala–d-lactate • cell envelope biosynthesis;
(Table 9.8). The modified peptide can enter the cell • proteins involved in transport and binding of
wall but has a lower binding affinity for vancomycin carbohydrates, organic alcohols, and acids;
by a factor of ∼1000. • salvage of nucleic acid components;
• regulatory functions; and
• Microorganisms also develop resistance by evolving • tetracycline resistance.
enzymes that destroy an antibiotic or pump it out of
cells. S. aureus followed this route to gain resistance
It is not always easy to put together the details of
to penicillin, which initially led clinicians to turn to a change in expression pattern involving many meta-
vancomycin. bolic subsystems in order to grasp the salient message
(see Figure 9.14). However, it is reasonable to think
that the goal of the changes is to defend the cell wall,
Mongodin and co-workers compared expression as that is the target of the antibiotic. As Mongodin
patterns of genes in VISA strains (MIC ∼8 mg/ml) and colleagues suggested, many of the changes in
with VRSA strains produced by selection, not con- expression levels combine to funnel metabolites to
taining the resistance plasmid.* The array contained the formation of ATP. These changes include down-
2688 oligonucleotides. The experiments were run regulation of the genes that encode proteins for con-
in parallel, starting with two different clinical VISA version of ATP to the corresponding deoxynucleoside
isolates. Upon increased vancomycin resistance, triphosphate for DNA synthesis (nrdD) and for the
degradation of AMP (deoD). Key enzymes in glycolysis
* Mongodin, E., Finan, J., Climo, M.W., Rosato, A., Gill, S.,
& Archer, G.L. (2003). Microarray transcription analysis and fermentation are downregulated, diverting
of clinical Staphylococcus aureus isolates resistant to van- glucose 6-phosphate through the pentose phosphate
comycin. J. Bacteriol. 185, 4638–4643. pathway to form the ribose component of ATP.
290 9 Microarrays and Transcriptomics
Glucose
Glutamate Pentose
Glucose-6P phosphate Ribose-5P
pathway
Glycolysis Xanthine,
Purine uracil
Lactate biosynthesis
Degradation AMP
Figure 9.14 With enhancement of vancomycin resistance in the laboratory by selection after vancomycin challenge, expression patterns
of genes associated with some processes are upregulated (blue arrows) and others are downregulated (red arrows). A major target of
upregulation is purine synthesis, aimed at enhanced production of ATP for energy requirements.
After: Mongodin, E., Finan, J., Climo, M.W., Rosato, A., Gill, S., & Archer, G.L. (2003). Microarray transcription analysis of clinical Staphylococcus
aureus isolates resistant to vancomycin. J. Bacteriol. 185, 4638–4643.
Synthesis of the thickened cell wall is a very energy- chemotherapy, the success rate for which has greatly
intensive process. The ratio of cell-wall volume to improved in the past quarter of a century. Neverthe-
total cell volume increased by 41% in the vancomycin- less, conventional therapy is unsuccessful in about
resistant cells. Perhaps reduced cellular growth rate is 25% of patients.
a price that must be paid if a larger fraction of the Measurements of gene expression patterns have per-
cell’s energy budget goes into cell-wall synthesis. mitted molecular classification of disease subtypes,
and correlations with response to chemotherapy and
• Drug resistance in pathogens is a crucial problem in likelihoods of rapid or delayed recurrence or long-
contemporary medicine. Learning, in detail, how it term survival.
comes about, will be essential for developing ways to
prevent it. • From the expression pattern of a group of 50 genes,
it is possible to distinguish almost perfectly between
lymphoblastic and myeloid leukaemias and, for
Childhood leukaemias lymphoblastic leukaemias, to distinguish B- and
Haemopoietic stem cells are the undifferentiated T-cell lineages. The results are calibrated against
precursors of all types of blood cell. They mature by established methods based on flow cytometry. The
differentiating along one of two pathways (see Fig- variability in expression pattern is unusually high
ure 9.15). B and T cells of the immune system have in acute lymphoblastic leukaemia relative to other
followed the lymphoid path; red blood cells have types of cancer. It is thereby feasible to create a
followed the myeloid path. An abnormal genetic molecular taxonomy of childhood leukaemias.
transformation leading to unregulated proliferation • Expression patterns can predict the likelihood of a
of any blood cell, at any stage of differentiation, gives favourable outcome. This combines questions of the
rise to leukaemia. Leukaemias can be classified success of therapy, the likelihood of spontaneous
according to the type of cell that is proliferating. relapse after remission, and the development of
Acute lymphoblastic leukaemia is the most com- secondary tumours. A complicating factor of studies
mon cancer of children, representing almost one- of this type in humans is the fact that the samples
third of childhood cancers. The main treatment is are taken from patients under a variety of treatments.
Applications of microarrays in medicine 291
Granulocytic Monocytic
stem cells stem cells
Figure 9.15 Haematopoiesis is the formation of new blood cells. Our blood contains many types of cell:
These cells arise by different developmental pathways from a common stem-cell precursor, which can potentially differentiate into any
of the mature cell types, a property called totipotency. As maturation proceeds, cells first become pluripotent (able to mature into some
but not all cell types) and then finally committed to a single ultimate form.
Normally, haematopoiesis produces approximately 175 billion red cells, 70 billion granulocytes (neutrophils, eosinophils, basophils),
and 175 billion platelets every day. [One billion is 109.] When we are challenged by infection, production can be stepped up by an order
of magnitude.
Leukaemia is the uncontrolled proliferation of any of these types of blood cell, either mature forms or their precursors. In mammals,
mature erythrocytes, being enucleate, cannot themselves proliferate. However, mutations can occur in precursor cells. For example, a
mutation affecting the JAK2 signalling pathway in the stem-cell precursor can result in overproduction of erythrocytes, a disease called
primary polycythaemia vera. (Secondary polycythaemia vera, also characterized by overproduction of red cells, is a response to lack of
oxygen; possible causes include heavy smoking, emphysema, or moving without acclimation to a high altitude.)
Clinical experience shows that the time interval of first gression. Genes involved in cell proliferation and DNA
remission is a good predictor of long-term survival. A repair were upregulated in the early-relapse group.
group of genes was found with an expression pattern Development of a second cancer, not related to
correlated with length of remission, i.e. these genes leukaemia in any obvious way, is a common and very
are differentially expressed in patients with early and serious complication of acute lymphoblastic leukaemia.
late relapse. Testing the expression levels of these genes Brain tumours are one of the most common second-
can improve the precision of prognosis. Identifying ary malignancies. Several genes have been identified,
the pathways in which the genes are involved can the expression patterns of which correlate with the
illuminate the underlying biology of the disease pro- risk of secondary brain tumours.
292 9 Microarrays and Transcriptomics
• Expression pattern can predict effectiveness of • Identification of specific genes involved in diseases
treatment and guide the choice of therapy. In a can suggest targets for drug development. For
study using 14 500 probe sets and samples from instance, one type of acute lymphoblastic leukae-
173 patients, sets of 20–40 genes were identified mia is associated with overexpression of the gene
that distinguish resistance and sensitivity to four FLT3 for a receptor tyrosine kinase. Patients with
different drugs: prednisolone, vincristine, aspara- mutations that produce constitutively active recep-
ginase, and daunorubicin. These results, even taken tors have a poor prognosis. FLT3 inhibitors are
as purely empirical correlations, have clinical util- now in clinical trials.
ity in guiding treatment. Their interpretation at the
Thus, expression profiles can permit tailoring of
genetic level reveals that the activities of the drugs
drug therapy, both to the specific disease and sub-
involve some different as well as some common
type, and to the patient.
pathways. A set of 45 genes was found to be cor-
related with resistance to all four of the drugs. The
• Expression profiling can (a) permit precise diagnosis of
majority of these genes involve transcription, DNA
the subtype of the disease, (b) predict the likely course
repair, cell-cycle maintenance, and nucleic acid
of the disease, and (c) guide choice of therapy.
metabolism.
Gene expression profiling has been the most common • If a population of cells includes both a host and
application of microarrays. The goal is to measure a pathogen, the transcriptomes of both are
the identities and relative amounts of different RNA simulaneously measurable. In contrast, without a
transcripts in populations of cells. Comparison of specially designed chip, a microarray would most
transcript profiles between healthy and disease states, likely contain probes only from the host.
or under different external conditions, or as a function
Nevertheless, despite many claims, reports of the
of time, reveal the changes in gene expression pat-
death of microarrays are greatly exaggerated. Tech-
terns. Examples of all of these appear in this chapter.
nically, although the sequencing platforms accurately
RNA-seq – or whole transcriptome shotgun
report the relative amounts of different cDNAs pre-
sequencing – is an alternative method of measuring
sented to them, there is potential bias in the yields of
RNA levels, applying high-throughput sequencing
reverse transcription from different RNA molecules.
techniques. RNA isolated from cells is fragmented,
For instance, internal secondary structure of the
reverse-transcribed to cDNA, and sequenced. Assembly
RNA may interfere with primer binding. (This may
is easiest by aligning with a reference genome. This will
be a problem with microarray experiments also.)
also automatically pick up post-transcriptional edits.
In addition, it is currently true that microarray mea-
The RNA-seq approach has some advantages over
surements are less expensive than the RNAseq
microarrays:
approach to collecting equivalent data. Of course,
• To construct a microarray, one must make a choice the cost of sequencing is changing rapidly.
of what probe sequences to include. RNAseq will
report whatever is there, with no prior commit-
ment to any set of possible sequences. • Methods for expression profiling include microarrays
and whole transcriptome shotgun sequencing, or RNA-
• In principle, sequencing methods give more precise
seq. Both methods are in widespread use at present. By
measurements of RNA concentrations, by record-
asking in detail what one wants from the results, one
ing how frequently each sequence appears in the
can make an intelligent choice between them for any
pooled results. Not only is sequencing potentially particular experiment on any particular system.
more precise, it has a higher dynamic range.
Exercises, problems, and weblems 293
● RECOMMENDED READING
Exercises
Exercise 9.1 In the third level (under Hybridization, washing), Figure 9.1 shows, schematically, a
row of 14 probe oligomers, corresponding to the leftmost 14 elements in one of the rows of the
19 × 19 square array of fluorescent dots in the fourth level. Which row?
Exercise 9.2 A professor of molecular biology wanted to design a microarray experiment to
detect tRNA sequences. He suggested using the sequences of the anticodon stem–loop as target
oligonucleotides (see Figure 1.4). A student pointed out that this was not likely to be successful.
For what reason?
Exercise 9.3 On a photocopy of Figure 9.2, (a) indicate the position of a gene that is highly
downregulated in one of the vehicle samples but not in the other two; (b) indicate the position of
a gene that is more highly upregulated in one of the dexamethasone samples than in the other two.
294 9 Microarrays and Transcriptomics
Exercise 9.4 On a photocopy of Figure 9.4, indicate the position of a gene that is first upregulated
and subsequently downregulated during the yeast diauxic shift.
Exercise 9.5 (a) Citrate synthase and (b) pyruvate decarboxylase are two of the enzymes that
change expression level in the diauxic shift in yeast. On a photocopy of Figure 9.3, mark the
approximate positions of the reactions that they catalyse.
Exercise 9.6 Transcriptome profiling is the measurement of patterns of mRNA concentrations.
However, 90% of the RNA in a cell is ribosomal. How could you apply the fact that mRNAs carry
a 3′ poly A tail to avoid interference from the high background levels of ribosomal RNA?
Problems
Problem 9.1 The average ratio of gene expression levels between human and chimpanzee is
∼1.5 for collinear chromosomes and can be as high as ∼1.6–1.7 for rearranged chromosomes
(Figure 9.12). (a) Qualitatively describe the differences in banding pattern between human and
chimpanzee for human chromosomes 5, 6, 9, and 10 (see Figure 4.18). (b) Are your results
consistent with the data in Figure 9.12? (c) Which would you expect to have an average ratio of
expression levels closer to 1.0, genes on human chromosome 3 and their gorilla homologues, or
genes on human chromosome 8 and their gorilla homologues?
Problem 9.2 Describe the reasons for and against including antibiotics routinely in animal feed.
Decide on a conclusion to be drawn from these arguments and formulate a paragraph of
recommendations.
Problem 9.3 A recent study of 1737 patients treated with vancomycin for S. aureus infections
concluded that many patients received less than an adequate dose to maintain serum
concentrations above the MIC.* For patients infected by vancomycin-sensitive S. aureus, this
characterized 7.9% of patients given continuous infusions and 19% of patients to whom the
drug was administered by periodic intravenous injections. For patients infected by S. aureus strains
with intermediate-level resistance to vancomycin, this characterized 79.1% of patients given
continuous infusions and 87.8% of patients given periodic intravenous injections. What are the
expected effects of this situation on (a) the individual patients involved, and (b) the spread of
vancomycin-resistant S. aureus strains?
Problem 9.4 You have a sample of mRNA which you convert to cDNA. From this material, what
information can you derive from a microarray that would not be available from a high-throughput
sequencing run with a Roche 454 Life Sciences Genome Sequencer?
Weblems
Weblem 9.1 An EST for rose germacrene D synthase appears in dbEST (GenBank) as entry
BQ105086. Blast this sequence against general sequence databases. What properties of rose
germacrene synthase can you infer from the results?
Weblem 9.2 (a) What is the chromosome location of the ASPM gene? (b) What mutations are
known? (c) What is their clinical effect?
Weblem 9.3 Search for articles on discovery of novel antibiotics and study a number of them
to learn about the current situation. (Suggested search terms: future antibiotic development.
Alternatively, consult articles cited as recommended reading in this chapter.) (a) Provide facts and
figures that describe trends in funding for research by large pharmaceutical companies devoted to
novel antibiotic development. (b) Assuming that these figures confirm that large pharmaceutical
* Kitzis, M.D. & Goldstein, F.W. (2006). Monitoring of vancomycin serum levels for the treatment of staphylo-
coccal infections. Clin. Microbiol. Infect. 12, 92–95.
Exercises, problems, and weblems 295
companies are curtailing the relevant research programmes, what are the reasons for this, in the
face of high demand for novel antibiotics? (c) Can you recommend appropriate actions by
governments, industry, and the academic community? Give credible reasons that justify the
conclusion that if your actions were adopted they would accelerate the development and approval
of novel antibiotics. What do you see as the most serious obstacles to acceptance of your ideas?
How would you suggest overcoming them? (Warning: consider the possibility that the less
ambitious your suggestions, the more credibly you can justify them.)
This page intentionally left blank
CHAPTER 10
Proteomics
LEARNING GOALS
• To understand the fundamental chemical structure of proteins: the mainchain and sidechains,
types of sidechain, and common post-translational modifications.
• To understand the basic description of protein conformation.
• To be able to distinguish between primary, secondary, tertiary, and quaternary structures.
• To understand the use of polyacrylamide gel electrophoresis (PAGE) to separate proteins.
• To understand the technique and uses of mass spectrometry.
• To appreciate the principles of classifications of protein-folding patterns.
• To understand the possibilities and difficulties of protein structure prediction.
• To understand the goals of structural genomics projects.
298 10 Proteomics
Introduction
The proteome is the complete set of proteins associ- • Structural genomics. This activity applies advances
ated with a sample of living matter. Proteomics deals in X-ray crystallography and nuclear magnetic
with the proteins that form the structures of living resonance (NMR) to high-throughput delivery of
things, are active in living things, or are produced by coordinate sets of proteins.
living things. This includes their nature, distribution, • Bioinformatics. This brings together the many
activities, interactions, and evolution. Many fields data streams of genomics, expression patterns, and
contribute to proteomics. proteomics, to assemble databases and create links
• Chemistry and biochemistry. These include physical among them. This enables their coordinated applica-
methods, such as spectroscopy, kinetics, and tech- tion to problems of biology, clinical medicine,
niques of structure determination, and organic and agriculture, and technology.
biochemical methods for working out mechanisms Bioinformatics co-ordinates its efforts with structural
of enzymatic catalysis. Techniques for separation genomics to guide and supplement experimental
and analysis of proteins have their sources in structure determinations by prediction of protein
chemistry and molecular biology. structures from amino acid sequences. Prediction of
• Molecular and cellular biology. These disciplines protein structure is an important technique, given the
help to coordinate our knowledge of individual great disparity between the very large number of
proteins into an understanding of the biological experimentally determined sequences and the rela-
context, and of how protein activities are integrated. tively few structures. Methods for prediction of pro-
• Evolutionary biology. Proteins evolve. Evolution tein structure from sequence, most of which take
explores variations in amino acid sequences, pro- advantage of the known protein structures, can pro-
tein structures, interactions and functions, and vide libraries of three-dimensional models of proteins
patterns of protein expression. encoded in genomes.
Proteins are where the action is. • The amino acid sequences of proteins dictate
their three-dimensional structures and their fold-
• Proteins have a great variety of functions. There
ing pathways. Under physiological conditions of
are structural proteins (molecules of the cytoskel-
solvent and temperature, proteins fold spontane-
eton, epidermal keratin, viral coat proteins);
ously to an active native state. The amino acid
catalytic proteins (enzymes); transport and storage
sequence of a protein must not only preferentially
proteins (haemoglobin, retinol-binding protein,
stabilize the native state it must also contain a
ferritin); regulatory proteins (including hormones,
‘road map’ telling the protein how to get there,
many kinases and phosphatases, and proteins that
starting from the many diverse conformations
control gene expression); and proteins of the
that comprise the unfolded state. This is called the
immune system and the immunoglobulin super-
folding pathway.
family (including antibodies and proteins involved
in cell–cell recognition and signalling). How can • Advances in protein science have spawned the bio-
proteins accomplish so many different things? By technology industry. It is now possible to design
coming in a great variety of structures, specialized and test modifications of known proteins and to
to carry out different functions. design novel ones with desired functions.
Protein structure 299
Protein structure
The chemical structure of proteins Hydrogen bonds are about 20 times weaker than
covalent chemical bonds. In aqueous solution,
Chemically, protein molecules are long polymers typ-
the solvent water can form hydrogen bonds to
ically containing several thousand atoms, composed
polar groups in nucleic acids and proteins. There
of a uniform repetitive backbone (or mainchain) with
is a competition between intramolecular hydro-
a particular sidechain attached to each residue (see
gen bonds and solute–solvent hydrogen bonds.
Figure 10.1). The amino acid sequence of a protein
Hydrogen bonds in solution can be easily broken
specifies the order of the sidechains.
and reformed.
• Hydrophobic interactions. Hydrophobic residues They present general solutions – where in this con-
have sidechains that are primarily hydrocarbon text ‘general’ means ‘compatible with all (or at least
in nature. They have thermodynamically unfavour- almost all) amino acid sequences’.
able interactions with water. Salad dressing is an Helices and sheets are like Lego® pieces, standard
everyday example of the hydrophobic effect: the units of structure of which many proteins are built
thermodynamic unfavourability of dissolving oil in and which can be put together in different ways.
water causes a phase separation. It is energetically Helices are formed from a single consecutive set of
favourable to bury hydrophobic sidechains in the residues in the amino acid sequence. They are there-
interior of a protein, where they are not exposed to fore a local structure of the polypeptide chain, i.e.
the solvent. This is a general feature of the struc- they form from a set of residues consecutive in the
tures of globular proteins. sequence. The mainchain hydrogen-bonding pattern
• Disulphide bridges. In addition to the primary of an a-helix, the most common type of helix, links
chemical bonds in the individual residues and the the C=O group of residue i to the H–N group of
peptide bonds joining the residues into a polymer, residue i + 4.
cysteine residues in proteins, with sidechain Sheets form by lateral interactions of several inde-
–CH2SH, can form disulphide bonds: –CH2S– pendent sets of residues to create a hydrogen-bonded
SCH2–. Disulphide bonds contribute to the stability network that is often nearly flat, but sometimes cylin-
of native states. In order to denature proteins fully, drical (forming a barrel structure). Unlike helices,
it is necessary to break any disulphide bonds. sheets need not form from consecutive regions of the
chain but may bring together sections of the chain
Different possible conformations of the backbone separated widely in the sequence.
of a protein bring different types of residue into spa-
tial proximity and expose some but not all residues
to the solvent. Every conformation therefore has a • Helices and sheets are recurrent structures, stabilized
by mainchain hydrogen bonding, that appear in many
different associated energy that depends on the distri-
protein structures.
bution of favourable and unfavourable interactions.
The native state of a soluble globular protein is the
conformation that optimizes the set of interactions
among the residues and between the residues and the
Conformation of the polypeptide chain
solvent.
The conformation of a polypeptide chain can be
described in terms of angles of internal rotation
• Different types of residues make different types of around the bonds in the mainchain (see Figure 10.2).
interactions, including hydrogen bonds, hydrophobic The bonds between the N and Ca, and between the
interactions, and disulphide bridges. Formation of the Ca and C, are single bonds. Internal rotation around
native structure allows optimal formation of favourable
these bonds is not restricted by the electronic struc-
inter-residue and residue–solvent interactions.
ture of the bond, only by possible steric collisions in
the conformations produced (see Box 10.2).
The entire conformation of the protein can be
Helices and sheets described by these angles of internal rotation. Each
Underlying the great variety of protein folding pat- set of four successive atoms in the mainchain defines
terns are some recurrent structural themes. Helices an angle. In each residue i (except for the N and C
and sheets are two conformations of the polypeptide termini), the angle fi is the angle defined by atoms
chain that appear in many proteins. They satisfy the C(of residue i − 1)–N–Ca–C, and the angle yi is the
hydrogen bonding potential of the mainchain N–H angle defined by atoms N–Ca–C–N(of residue i + 1).
and C=O groups, while keeping the mainchain in an Then wi is the angle around the peptide bond itself,
unstrained conformation. They thereby solve certain defined by the atoms Ca–C–N(of residue i + 1)–
structural problems faced by all globular proteins. Ca(of residue i + 1).
Protein structure 301
α
C+1 The peptide bond has a partial double-bond char-
acter and adopts two possible conformations: trans
Ni+1 Hi+1 (by far the more common) and cis (rare). Angle w is
restricted to be close to 180° (trans) or 0° (cis).
Oi Ci Proline is an exception: the sidechain is linked
ωi back to the N of the mainchain to form a pyrrolidine
ψi ring. This restricts the mainchain conformation
Cαi of proline residues. It disqualifies the N atom as a
Hi hydrogen-bond donor, for instance in helices or
φi sheets. Also, the energy difference between cis and
Ni Ciβ
trans conformations is less for proline residues than
Ci−1 for others. Most cis peptides in proteins appear
before prolines.
α
Ci−1 Oi−1
Figure 10.2 Conformational angles describing the folding of the Protein folding patterns
polypeptide chain.
The mainchain of each residue (except the C-terminal residue)
Focusing on the backbone, in the native state the
contains three chemical bonds: N–Ca, Ca–C, and the peptide polypeptide chain follows a curve in space. The
bond C–N linking the residue to its successor. The conformation general spatial layout of this curve defines a folding
of the mainchain is described by the angles of rotation around pattern. We now know over 70 000 protein struc-
these three bonds: tures. There is great but not infinite variety: many
proteins have similar folding patterns. The native
Rotation around: N–Ca
a bond Ca–C bond Peptide bond
(C–N)
states are selected from a large but finite repertoire.
In describing protein structures, the Danish protein
Name of angle: f y w chemist K.U. Linderstrøm-Lang described a hier-
archy of levels of protein structure. The amino acid
w is restricted to be close to 180° (trans) or, infrequently, close
sequence – roughly the set of chemical bonds – is
to 0° (cis).
The mainchain conformation of each residue is determined The two major allowed conformations of the mainchain,
primarily by the two angles f and y, assuming the common aR and b, correspond to the two major types of secondary
trans conformation of the peptide bond, w = 180°. structure: a-helix and b-sheet. The a-helix is right-handed,
For some combinations of f and y, atoms would collide, like the threads of an ordinary bolt. In the b region, the
a physical impossibility. V. Sasisekharan, C. Ramakrishnan, chain is nearly fully extended.
and G.N. Ramachandran first plotted the sterically allowed A graph showing the f and y angles for the residues of
regions (see Figure 10.3). There are two main allowed a protein against the background of the allowed regions is
regions, one around f = −57°, y = −47° (denoted aR) and called a Sasisekharan–Ramakrishnan–Ramachandran plot,
the other around f = −125°, y = +125° (denoted b) with a often called a Ramachandran plot for short.
‘neck’ between them. The mirror image of the aR confor- It is no coincidence that the same conformations that
mation, denoted aL, is allowed for glycine residues only. correspond to low-energy states of individual residues
(As glycine is achiral – identical to its mirror image – a also permit the formation of structures with extensive
Ramachandran plot specialized to glycine must be right– mainchain hydrogen bonding. The two effects thereby
left symmetric. For non-glycine residues, collisions of the cooperate to lower the energy of the native state.
Cb atom forbid the aL conformation.)
➔
302 10 Proteomics
180°
82N αL
21F
G G
G
ψ 0
αR
G
G
Figure 10.3 A Sasisekharan–Ramakrishnan– G
Ramachandran plot of bovine acylphosphatase G G
G
[2ACY]. Sterically most-favourable regions are −180°
shown in green and sterically allowed regions −180° 0 180°
in yellow. Residues with f > 0, mostly glycines, φ
appear in red.
called the primary structure. The assignment of heli- successive steps in the pathway of biosynthesis of
ces and sheets – the hydrogen-bonding pattern of the aromatic amino acids correspond to five regions of
mainchain – is called the secondary structure. The a single protein in the fungus Aspergillus nidulans.
assembly and interactions of the helices and sheets is
called the tertiary structure. For proteins composed
• We describe protein folding patterns according to a
of more than one subunit, J.D. Bernal called the
hierarchy of primary, secondary, tertiary, and quater-
assembly of the monomers the quaternary structure
nary structures. See Box 10.3 and Figure 10.4.
(see Figure 10.4 and Box 10.3).
Some proteins change their quaternary structure
as part of a regulatory process. Cyclic AMP activates
protein kinase A by a mechanism involving subunit Domains
dissociation. The resting, inactive form of protein One way that proteins have evolved increasing com-
kinase A is a tetramer of two catalytic subunits and plexity is by assembling a large protein from a set of
two regulatory subunits. In this resting state, the regu- smaller quasi-independent subunits, either by forming
latory subunits inhibit the activity of the catalytic stable oligomers, as in haemoglobin (see Figure 10.4),
subunits. Binding of cyclic AMP to protein kinase A or by concatenating units within a single polypeptide
dissociates the tetramer, releasing individual catalytic chain. Domains are compact units within the folding
subunits in active form. pattern of a single chain. Justifications for regarding
In some cases, evolution can merge proteins – chang- them as quasi-independent include the observation
ing quaternary to tertiary structure. For example, five that domains can be ‘mixed and matched’ in different
separate enzymes in Escherichia coli that catalyse proteins and, in many cases, the similarities of their
Protein structure 303
Figure 10.4 Underlying the great variety of protein folding patterns are a number of common structural features. For instance, a-helices
and b-sheets are standard elements of the ‘parts list’ of many protein structures. a-helices and b-sheets were modelled by L. Pauling
before their experimental observation. Pauling recognized that helices and sheets provide convenient ways for the residues to achieve
comfortable steric relationships and satisfy the requirements for backbone hydrogen bonding in an (almost) sequence-independent
manner.
This figure shows, at the upper left, the primary structure in terms of a simple extended chain. The standard secondary structures,
the a-helix and b-sheet, are shown at the upper right, with hydrogen bonds indicated by broken lines. Tertiary structure is represented,
at the lower left, by acylphosphatase, which contains two a-helices packed against a five-stranded b-sheet. Human haemoglobin,
a tetramer containing two copies of two types of chain, illustrates quaternary structure, at the lower right. (Acylphosphatase is not
a subunit of haemoglobin.)
folding patterns to those of homologous monomeric is a linear array of the form: (F1)6(F2)2(F1)3(F3)15(F1)3.
proteins. Fibronectin domains also appear in other modular
Domains form the basis of the higher-level protein proteins. (See http://www.bork.embl-heidelberg.de/
structural organization typical of eukaryotic proteins. Modules/ for pictures and nomenclature.)
Modular proteins are multidomain proteins that often To create new proteins, inventing new domains is
contain many copies of closely related domains. For an unusual event. It is far more common to create
example, fibronectin, a large extracellular protein different combinations of existing domains in in-
involved in cell adhesion and migration, contains 29 creasingly complex ways. These processes can occur
domains including multiple tandem repeats of three independently, and take different courses, in different
types of domain, F1, F2, and F3 (see Figure 4.16). It phyla.
304 10 Proteomics
Post-translational modifications
How much does genomics actually tell us about the • the nature and binding sites of ligands integral to
proteome? Even if we could identify coding regions the final structure; and
of genomes with complete accuracy, we would not • post-translational modifications, the subject of this
know about: section.
• levels of transcription – or even absence of The ribosome synthesizes proteins by using the
transcription; genetic code to direct the incorporation of a sequence
of amino acids chosen from the canonical 20. Seleno-
• formation of different splice variants (in
methionine and pyrrolysine are two natural rare
eukaryotes);
extensions of the standard genetic code.
• mRNA editing – exclusive of splicing – before However, the protein world is richer than the stand-
translation, which alters the amino acid sequence; ard genetic code suggests. Many proteins contain
Post-translational modifications 305
ligands, such as metal ions or small organic mole- carbohydrates. Disulphide bridge formation is a
cules, as intrinsic and permanent parts of the struc- related example. Some additions, such as sulpha-
tures. The nature of the binding between protein and tion, are permanent modifications; others, notably
ligand depends on the protein as well as the ligand. phosphorylations, are in many cases reversible.
For instance, the haem group is bound covalently • Conversions, for instance deamidation of aspara-
to cytochrome c but non-covalently to (almost all) gine (or glutamine) to aspartic acid (glutamic acid),
globins. (Of course, proteins bind many molecules or deimination of arginine to citrulline.
transiently. Enzyme–substrate complexes provide
• Removing peptides, either from a terminus or from
many examples.)
the middle of the chain, and in a few cases even
Post-translational modifications can take several
making cyclic permutations.
forms (see also Box 10.4).
• Addition of other peptides or proteins, not always
• Attaching various groups to sidechains, including by extension of the mainchain, through peptide
but not limited to acetate, phosphate, lipids, and linkages.
– Facilitation of folding. Insulin contains two polypep- • Most post-translational cleavage reactions are carried
tide chains, one of 21 residues and the other of 30 out by proteases. Alternatively, inteins are proteins that
residues. The precursor proinsulin is a single 81- have a ‘self-splicing’ activity. They autocatalytically
residue polypeptide chain from which excision of an excise internal peptides and join the ends. (In contrast,
internal peptide produces the mature protein. Insulin peptide excision from proinsulin leaves two chains that
contains one intrachain and two interchain disulphide are not joined by a peptide bond.)
bridges. Attempts to renature mature insulin – after • The lectin concanavalin A is synthesized in a precursor
unfolding and breaking the disulphide bridges – give form that is a cyclic permutation of the final structure.
poor yields. Many incorrectly paired disulphide bridges Thus, during maturation of the protein, there is cleavage
form. In vivo, the precursor proinsulin folds into a of an internal peptide bond and formation of a new
three-dimensional structure with the cysteines in peptide bond between the original N and C termini. For
proper relative positions to form the correct disulphide concanavalin A, the DNA sequence of the gene is not
bridges. Excision of a central region by endopeptidases co-linear with the amino acid sequence of the mature
then produces the mature dimer. Unfolded proinsulin protein.
can spontaneously refold correctly.
Why is there a common genetic code with 20 There is now consensus that prokaryotes were and
canonical amino acids? are engaged in widespread horizontal gene transfer
(see p. 120). This suggests – leaving aside the ques-
Almost all organisms synthesize proteins containing
tion of what the optimal genetic code should be –
a canonical set of 20 amino acids.
that it would be to the advantage of any participating
However, both nature and the laboratory show
species to conform to some standard, as that would
that 20 amino acids are not a fundamental limita-
give it access to all of the other genes. Analogously,
tion. Selenomethionine and pyrrolysine are natural
anyone can run any operating system on a computer
exceptions. P. Schultz and co-workers have extended
that they want, but the obvious advantages of run-
the genetic code by introducing modified tRNAs and
ning the same system as many other people exert
synthetases into E. coli, yeast, and even mammalian
pressure to conform to some standard. Perhaps that
cells in tissue culture.* Approximately 70 novel
is at least a partial explanation of why almost all spe-
amino acids are now available to be introduced into
cies have the same genetic code.
proteins at specific sites. Some of the novel amino
It is true that if different species adopted different
acids show designed steric or electronic properties;
genetic codes, this might protect them against viruses
others contain chromophores as fluorescent reporters
jumping from other species.
or are susceptible to photocross-linking; there are
But why not a code with many more than 20 amino
glycosylated amino acids; iodine derivatives to facili-
acids? Certainly, one perfectly feasible way to intro-
tate X-ray structure determination; and sidechains
duce greater versatility into the components of pro-
containing other types of reactive group.
teins is by expanding the genetic code. However,
In understanding the contents and layout of the
keeping within the general framework of a triplet
common genetic code, can we go beyond F. Crick’s
code, introducing more amino acids at the expense
comment that the code is a ‘frozen accident’?
of the redundancy of the code threatens to reduce
robustness. An alternative approach to greater versa-
tility without this cost is to effect post-translational
modifications of individual amino acids. Whether or
* See http://schultz.scripps.edu/research.html and Wang, not this reasoning is the correct explanation, post-
L. & Schultz, P.G. (2004). Expanding the genetic code. translational modification is the choice that nature
Angew. Chem. Int. Ed. Engl. 44, 34–66. seems largely to have made.
Separation and analysis of proteins 307
The complete complement of a cell’s proteins is a well as larger ones, and therefore move faster through
large and complex set of molecules. Metazoa contain the gel than larger proteins. Proteins with different
tens of thousands of protein-encoding genes. Differ- mobilities move different distances during a run,
ent splice variants multiply the number of possible spreading them out on the gel.
proteins. Vertebrate immune systems generate billions The mobility of a native protein depends on its mass
of molecules by specialized techniques of combina- and its shape. Higher mass tends to reduce mobility;
torial gene assembly. more compact shape tends to increase it. In particular
To give some idea of the ‘dynamic range’ required the mobility of denatured proteins is lower than that
of detection techniques, the protein inventory of a of the corresponding native states. To achieve a separa-
yeast cell varies from 1 copy per cell to 1 million tion that depends solely on molecular weight, dena-
copies per cell. ture the proteins. Common denaturing media include
Examples of techniques for separating mixtures of urea (which competes for hydrogen bonds), and the
proteins include gel filtration, chromatography, and reducing agent dithiothreitol to break S–S bridges
electrophoresis. All methods of separating molecules (and iodoacetamide to prevent their reformation).
require two things: Sodium dodecyl sulphate (SDS) is a negatively
charged detergent that helps to denature proteins.
1. A difference in some physical property, between
Multiple detergent molecules bind all along the poly-
the molecules to be separated; and
peptide chain. The result is a protein–detergent com-
2. a mechanism, taking advantage of that property, plex that has an extended shape, with a uniform
to set the molecules in motion; the speed differing charge density along its length.
according to the value of the property selected. This Carrying out SDS-PAGE in one dimension spreads
moves apart molecules with different properties. out a mixture of proteins or nucleic acids into bands.
In some separation methods, one component Running several samples on the same gel in parallel
can stand still and the other(s) move away from it. lanes is a familiar procedure if only from sequencing
Affinity chromatography is an example. With others, gels. The results of protein gels can be made visible
different species can all move, at different rates, and (‘developed’) by staining with Coomassie Blue, or, if
spread themselves out. the samples are radioactively labelled, by autoradio-
graphy. Often markers of known molecular weight
are run in a separate lane for calibration.
• To measure an inventory of the proteins in a sample,
the proteins must be: (1) separated, (2) identified,
(3) counted.
Two-dimensional polyacrylamide gel
electrophoresis (2D-PAGE)
One-dimensional PAGE will not adequately separate
a very complicated mixture of proteins. The bands in
Polyacrylamide gel electrophoresis (PAGE)
a lane on a gel will overlap, and contain mixtures of
In electrophoresis, an electric field exerts force on a proteins with similar sizes. To achieve better resolu-
molecule. The force is proportional to the molecule’s tion, a two-stage procedure first separates proteins
total or net charge. In a vacuum, the corresponding according to charge; then an SDS-PAGE step, run in
acceleration would be inversely proportional to the a direction 90° from the original direction, separates
mass. However, counteracting the acceleration from according to size.
the electric field are retarding forces from the medium The charge on a protein depends on the charged
through which the proteins move. Polyacrylamide residues it contains, and the pH of the medium. At
gels contain networks of tunnels, with a distribution different values of pH, ionizable groups on proteins
of sizes. Smaller proteins can enter smaller tunnels as have different charges. For instance, a free histidine
308 10 Proteomics
MW
97.4
66.2
45.0
31.0
21.5
14.4
pl 4 5 6 7 8 4 5 6 7 8 4 5 6 7 8
Stage 1 Stage 4 Stage 6
Figure 10.5 Two-dimensional PAGE gels of rose petal proteins at developmental stages 1, 4, and 6. Each gel contains over 600 proteins,
of which 421 are common to all three stages. About 12% of the proteins are stage specific.
From: Dafny-Yelin, M., et al. (2005). Flower proteome: changes in protein spectrum during the advanced stages of rose petal development. Planta 222,
37–46.
1919.0556
100
1622.8511
1157.6235
%
981.5262
Figure 10.8 Mass spectrum of a tryptic digest. Of the 21 highest peaks (shown in black), 15 match expected tryptic peptides of the
39 kDa subunit of cow mitochondrial complex I. This easily suffices for a positive identification.
Figure courtesy of Dr I.M. Fearnley, MRC Dunn Human Nutrition Unit, Cambridge, UK.
peptide bonds yield a series of ions differing by the Measuring deuterium exchange in proteins
masses of single amino acids. The y ions are a set If a protein is exposed to heavy water (D2O), mobile
of nested fragments containing the C terminus (see hydrogen atoms will exchange with deuterium at
Figure 10.9a) (b ions are nested fragments contain- rates dependent on the protein conformation. By
ing the N terminus). The difference in mass between exposing proteins to D2O for variable amounts of
successive y ions is the mass of a single residue. time, mass spectrometry can give a conformational
The amino acid sequence of the peptide is, therefore, map of the protein. Applied to native proteins, the
deducible from analysis of the mass spectrum (see results give information about the structure. Applied
Figure 10.9b). to initially denatured proteins brought to renaturing
Two ambiguities remain: Leu and Ile have the same conditions using pulses of exposure, the method can
mass and cannot be distinguished, and Lys and Gln give information about intermediates in folding.
have almost the same mass and usually cannot be dis-
tinguished. Discrepancies from the masses of standard
• Mass spectrometry is often used to characterize
amino acids signal post-translational modifications.
proteins isolated from mixtures. The peptide mass
In practice, the sequence of about 5–10 amino acids
fingerprint – the list of fragment masses – is usually
can be determined from a peptide of length <20–30
sufficient to identify a protein.
residues.
M N L Q V V R COO
M N L Q V V COO
M N L Q V COO
Mass spectrometry
C-terminal sequence
(b)
R V V Q L L N M yMax
614.37
100
y5
501.29
y4
727.45
y6
246.06
b2
%
104.03 841.49
a1 175.09 y7
373.24
y1 y3
y2
0 M/z
100 200 300 400 500 600 700 800 900 1000
Figure 10.9 Peptide sequencing by mass spectrometry. Collision-induced dissociation produces a mixture of ions. (a) The mixture
contains a series of ions, differing by the masses of successive amino acids in the sequence. The ions are not produced in sequence
as suggested by this list, but the mass-spectral measurement automatically sorts them in order of their mass/charge ratio. (b) Mass
spectrum of fragments suitable for C-terminal sequence determination. The greater stability of y ions over b ions in fragments produced
from tryptic digests simplifies the interpretation of the spectrum. The mass differences between successive y-ion peaks are equal to the
individual residue masses of successive amino acids in the sequence. Because y ions contain the C terminus, the y-ion peak of smallest
mass contains the C-terminal residue etc., and therefore the sequence comes out ‘in reverse’. The two leucine residues in this sequence
could not be distinguished from isoleucine in this experiment.
From: Carroll, J., Fearnley, I.M., Shannon, R.J., Hirst, J., & Walker, J.E. (2003). Analysis of the subunit composition of complex I from bovine heart
mitochondria. Mol. Cell Proteomics 2, 117–126 (supplementary figure S138).
312 10 Proteomics
• CE: a database of structural alignments which have little secondary structure and have struc-
http://cl.sdsc.edu/ tures stabilized by disulphide bridges or ligands.
Box 10.5 shows the SCOP classification of E. coli
These sites describe projects derived from the primary
CheY (see Figure 10.10).
archival databases of macromolecular coordinates.
They are useful general entry points to protein struc-
tural data.
• SCOP (Structural Classification of Proteins) offers facil-
ities for searching on keywords to identify structures,
• It is in the tertiary structure of domains that proteins navigation up and down the hierarchy, generation of
show their individuality and variety. Classifying proteins pictures, access to the annotation records in the PDB
according to their tertiary structure indicates evolu- entries, and links to related databases.
tionary relationships (or, at the very least, interesting
structural similarities) between proteins that might
have diverged so far that the relationship is not detect- The latest SCOP release contains 38 221 PDB
able by comparing their amino acid sequences. entries split into 110 800 domains. The distribution
of entries at different levels of the hierarchy is shown
in Table 10.1.
To locate a protein of interest in SCOP, the user
SCOP can traverse the structural hierarchy, or search via
SCOP, by A.G. Murzin, L. Lo Conte, B.G. Ailey, S.E.
Brenner, T.J.P. Hubbard, and C. Chothia, organizes
protein structures in a hierarchy according to evolu- BOX SCOP classification of CheY
10.5 protein of Escherichia coli
tionary origin and structural similarity. At the lowest
level of the hierarchy are individual domains. SCOP
groups sets of domains into families of homologues, 1. Root: SCOP.
for which the similarities in structure and sequence 2. Class: Alpha and beta proteins (a/b).
(and sometimes function) imply a common evolu- Mainly parallel b-sheets (b–a–b units).
tionary origin. Families, containing proteins of similar 3. Fold: Flavodoxin-like.
structure and function but for which the evidence Three layers, a/b/a; parallel b-sheet of five strands,
for evolutionary relationship is suggestive but not order 21345.
compelling, form superfamilies. Superfamilies that 4. Superfamily: CheY-like.
share a common folding topology, for at least a large
5. Family: CheY-related.
central portion of the structure, are grouped as folds.
Finally, each fold group falls into one of the general 6. Protein: CheY protein.
classes. The major classes in SCOP are a, b, a + b, 7. Species: Escherichia coli.
a/b, and miscellaneous ‘small proteins’, many of
Table 10.1 The distribution of SCOP entries at different levels of the hierarchy
keywords, such as protein name, PDB code, function Changes in folding patterns in protein
(including Enzyme Commission number), and name evolution
of fold (for instance, barrel). For each structure,
SCOP provides textual information, pictures, and Proteins identified by SCOP as related by evolution
links to other databases. show recognizably similar but not identical folding
Numerous other web sites offering classifications patterns. Figure 10.11 compares spinach plastocyanin
of protein structures are indexed at: http://www. and cucumber stellacyanin. For illustrations of the
bioscience.org/urllists/protdb.htm. degree of similarities of proteins grouped together at
(a) (b)
(c) (d)
Figure 10.11 Two related proteins that share the same general folding pattern, but differ in detail. Circles represent copper ions.
(a) Spinach plastocyanin [1AG6], (b) cucumber stellacyanin [1JER]. Superposition showing (c) the entire structures and (d) only the
well-fitting core (plastocyanin, green; stellacyanin, magenta). The main secondary structural elements of these proteins are two b-sheets
packed face-to-face. It is seen in the superposition that several strands of b-sheet are conserved but displaced, and that the helix at the
right of the cucumber stellacyanin structure has no counterpart in the spinach plastocyanin structure. Even the (relatively) well-fitting
core shows the conservation of folding topology but nevertheless reveals considerable distortion.
314 10 Proteomics
different levels of the hierarchy, and discussion of other pictures of protein structures suitable for browsing
classification schemes, see Chapter 4 of Introduction by any reader interested in exploring the stunning
to Protein Architecture (Lesk, A.M. Oxford Univer- variety of folding patterns seen in nature.
sity Press, Oxford), which contains a large number of
The fundamental principle is that proteins fold to 4. The EP complex breaks down to release product
unique native structures. However, the mechanism (P) and re-form the original enzyme.
of action of many proteins requires flexibility, or
conformational change, during the active cycle. Most E + S = ES = ES ‡ → EP → E + P
protein conformational changes are responses to
binding.
Many enzymes have different structures in the unlig-
• An enzyme may change structure after binding a ated state, in the Michaelis complex, in the transition
substrate and/or a cofactor. state, and/or in the enzyme–product complex.
• Conformational changes arising from interactions Many binding sites occur in clefts between pro-
with one or more other proteins, or nucleic acids, are tein domains. Binding often induces conformational
a common component of regulatory mechanisms. changes involving reorientation of domains, to close
the structure around the ligand. Some of these
• Some proteins are microscopic motors, intercon-
changes can be described as a ‘hinge motion’, in
verting chemical and mechanical energy.
which the two domains remain individually rigid but
• Some serpins (serine protease inhibitors) are syn- change their relative orientation by means of struc-
thesized in a metastable active state and convert tural changes in only a few residues in the regions
spontaneously to an inactive native state. This linking the domains. Hinge motion in myosin is
gives them a limited lifetime of activity, under responsible for the impulse in muscle contraction.
tighter control than normal turnover processes.
(Serpins also undergo an analogous structural
change when they are cleaved by proteases, as part • By their nature, transition states are reactive and
of their mechanism of inhibition.) difficult to trap long enough for structure deter-
mination. Possible solutions include enzymes binding
transition-state analogues or inhibitors, or lowering the
• Many proteins are microscopic machines, with internal temperature to slow down the reaction.
parts moving in precise ways to support their function.
Figure 10.12 Superposition of two conformational states of horseshoe crab arginine kinase [1M15, 1M80]. The unligated state is shown
in pink and purple; the ligated state in dark green and cyan. The ligands, arginine and ADP, appear in the ligated structure only. There
are steric clashes between the ligands and the unligated structure in its position in the picture.
The nature of the conformational change has reminded many people of the Venus flytrap. Regions of the structure have come together
around the ligands. The motion of the small domain at the top of the picture is primarily a ‘hinge’ motion – the mobile domain moves
almost rigidly around an axis through the interdomain interface. The axis is approximately perpendicular to the page.
The parts of the structure at the lower right also deform. This protein is showing ‘induced fit’ – in response to ligation.
Figure 10.13 Superposition of two structures of ribose-binding protein [2DRI, 1URP]. The unligated structure is shown in pink and cyan;
the ligated structure is shown in dark green and purple. The ribose, in yellow, appears only in the ligated structure.
Compared with arginine kinase (Figure 10.12), this is a more pure ‘hinge’ motion: the individual domains remain nearly rigid.
The conformational change is achieved by rotations about bonds in only a few residues in the hinge region itself.
316 10 Proteomics
Ribose-binding protein, like many other members Ras has a GTPase activity to reset it to the inactive
of this family, undergoes conformational changes state. Mutations that abrogate the GTPase activity
upon ligation such that domains of the protein close are oncogenic. Mutants that are trapped in the active
around the ligand. The structural changes increase state continuously trigger proliferation. Mutations in
the protein–ligand interactions. They also create a Ras appear in 30% of human tumours.
new surface recognized by transport complexes.
Figure 10.14 p21 Ras bound to GTP. Although an active GTPase, the system was stabilized for crystal-structure analysis by cooling to
100 K [1QRA].
Figure 10.15 The conformational change in p21 Ras from the inactive GDP-binding conformation to the active GTP-binding
conformation primarily involves two regions (shown here in red) that form a patch on the molecular surface [1QRA, 1Q21].
Many proteins change conformation as part of the mechanism of their function 317
Actin
Myosin Figure 10.16 Schematic diagram of a sarcomere. Thick
myosin filaments (red) overlap thin actin filaments (black).
In the main diagram, it is cursorily indicated that multiple
myosin molecules from thick filaments interact with adjacent
thin filaments. In fact, each thick filament contains several
hundred myosin molecules. The inset shows different
stages of the power stroke. From left to right: attachment,
conformational change propelling the thin filament inwards
by ∼10 nm, detachment (followed by recovery of original
conformation of the myosin head.)
Some motor proteins propel themselves – and their The sliding filament mechanism of muscle contraction
cargo – by exerting force against a stationary object, The structural and mechanical unit of vertebrate
such as a cytoskeletal filament. Others remain sta- skeletal muscle is an intracellular organelle called the
tionary and propel movable objects. sarcomere. Sarcomeres contain interdigitating fila-
• Myosins interact with actin during muscle ments of actin and myosin (Figure 10.16). The actin
contraction. filaments are fixed to structures called the Z-disks at
the ends of the sarcomere. The motor protein myosin
• Kinesins and dyneins interact with microtubules,
pulls the actin filaments inwards towards the centre
mediating organelle transport, chromosome separ-
of the sarcomere. During contraction, the actin and
ation in mitosis, and movements of cilia and
myosin filaments do not themselves shorten but slide
flagella.
past one another, shortening the sarcomere by increas-
Myosins, kinesins, and dyneins are primarily linear ing the region of overlap. Think of the shortening of
motors. In contrast ATPase is a rotary motor. a bicycle pump during its compression stroke.
A large muscle may contain ∼104–105 sarcomeres,
• ATPase rotates during its action. Oxidative phos-
laid end to end. Each sarcomere has a resting length
phorylation and photosynthesis create pH gradi-
of ∼2.5 mm and can contract by ∼0.3 mm. Therefore,
ents across the membranes of mitochondria and
the entire muscle can contract by about ∼1–2 cm.
chloroplasts, respectively. The mechanical step
Individual myosin molecules are large fibrous
of ATPase activity is part of the mechanism for
proteins of relative molecular mass ∼5 × 105. They
converting the osmotic energy of the potential
contain a fibrous section ∼1.6–1.7 mm long, and a
gradient across a membrane to the high-energy
globular head. Each thick filament contains ∼200–
phosphate bond of ATP.
300 myosin molecules. The mechanical coupling
between actin and myosin occurs through the myosin
head, as shown in Figure 10.16.
• Motor proteins are energy transducers. Some involve During the power stroke, the myosin head under-
conversion of chemical energy – via ATP hydrolysis – to
goes a cycle of attachment–detachment and confor-
mechanical energy. ATP synthase converts chemios-
mational change. From left to right in the inset
motic energy to chemical energy by ATP formation,
with a rotary motor as part of its mechanism.
in Figure 10.16, attachment of the myosin head is
followed by conformational change that propels the
318 10 Proteomics
actin towards the centre of the sarcomere. Detach- Allosteric proteins show ‘action at a distance’:
ment is followed by restoration of the initial confor- ligand binding at one site affects activity at another.
mation of the myosin. The myosin heads are like oars An impulse at the first site must transmit a confor-
that ‘row’ the actin filaments towards the centre mational change affecting the second. In contrast,
of the sarcomere. The displacement of the actin is GTP-activated p21 Ras (Figures 10.14 and 10.15)
∼10 nm per myosin molecule per cycle. Hydrolysis shows ligand-induced regulation of activity, but the
of one molecule of ATP during each cycle of each structural change is adjacent to the ligand. It is more
myosin molecule provides the energy. challenging to explain the properties of haemoglo-
Structures of fragments of myosin containing the bin, in which the shortest distance between binding
globular head have defined the mechanism of the sites is over 20 Å (2 nm).
conformational change (see Figure 10.17).
1.0
• Allosteric proteins deviate from the Michaelis–Menten
curve in ligand binding or, in the cases of allosteric
0.8 enzymes, in reaction velocity as a function of sub-
strate concentration. The cooperativity is achieved by
Myoglobin
ligation-induced conformational change.
Fraction ligated
0.6
Haemoglobin
0.4
A mammalian foetus depends on its mother for
oxygen. It is perhaps surprising that the oxygen affin-
ity of isolated human foetal haemoglobin is lower
0.2 than that of adult haemoglobin. However, foetal hae-
moglobin has a lower affinity for the effector BPG.
This difference in the interaction with the effector
0
0 10 20 30 40 50 60 gives foetal haemoglobin a higher oxygen affinity
Partial pressure of oxygen (mmHg) than the maternal haemoglobin.
Figure 10.18 Oxygen-dissociation curves for myoglobin
The vertebrate haemoglobin molecule is a tetramer
and haemoglobin. Myoglobin shows a simple equilibrium, containing two identical a chains (a1 and a2) and
with a binding constant independent of oxygen concentration. two identical b chains (b1 and b2) (see Figure 1.16).
Haemoglobin shows positive cooperativity, the binding constant It can adopt two structures: deoxyhaemoglobin
for the first oxygen being several orders of magnitude smaller (unligated) and oxyhaemoglobin (four oxygen mole-
than the binding constant for the fourth oxygen. The units for
cules bound).
partial pressure are traditional in the literature about this topic:
760 mmHg = 1 atmosphere = 101 325 Pa.
1. At low partial pressures of oxygen, all of the sub- • The tertiary structural changes alter the shapes of
units of haemoglobin are in the T form and are the surfaces of the subunits, changing the way they
unligated (i.e. not binding oxygen). The binding fit together.
constant for oxygen is low, because binding to the
The haemoglobin tetramer can be thought of as a
T state is inhibited.
pair of dimers: a1 b1 and a2 b2. The allosteric change
2. At high partial pressures of oxygen, all of the sub- involves a rotation of 15° of the a1 b1 dimer with
units of haemoglobin are in the R form and each respect to the a2 b2 around an axis approximately per-
subunit binds an oxygen. The binding constant pendicular to their interface. (The motion is like that
for oxygen is high, because binding of oxygen to of a pair of shears with a1 and a2 as the blades and b1
the R state is unconstrained. and b2 as the handles.)
3. In the erythrocyte, haemoglobin is an equilibrium Starting from the deoxy structure, ligation of oxy-
mixture of deoxy and oxy forms; the concentra- gen creates strain at the haem group, arising from a
tion of partially ligated forms is tiny. Binding of change in the position of the iron and the histidine
between two and three oxygen molecules shifts sidechain linked to it. To relieve this strain, there are
the subunits concertedly from all being T state to shifts in the F helix and the FG corner (the region of
all being R state. the chain between the F and G helices). To accom-
4. Effector molecules such as BPG modify oxygen modate these shifts, a set of tertiary structural
affinity by shifting the T j R equilibrium, by pre- changes alter the overall shape of the a1 b1 and a2 b2
ferentially stabilizing one of the two forms. dimers, notably the shifting of the relative positions
of the FG corners. In consequence, the deoxy quater-
The interpretation of this scheme in structural nary structure is destabilized because the dimers no
terms was one of the early triumphs of protein crystal- longer fit together properly (having changed their
lography. The structures of haemoglobin in different shape). Adopting the alternative quaternary structure
states of ligation have been studied with intense requires the tertiary structural changes to take place
interest, because of their physiological and medical even in subunits not yet liganded. As a result of the
importance and because they were thought to offer quaternary structural change, these unligated sub-
a paradigm of the mechanism of allosteric change. units have been brought to a state of enhanced oxy-
The two crucial questions to ask of the haemoglobin gen affinity. It is important to emphasize that this is
structures are: a sequence of steps in a logical process and not a
1. What is the mechanism by which the oxygen affin- description of a temporal pathway of a conforma-
ity of the deoxy form is reduced? tional change.
2. How is the equilibrium between low- and high-
affinity states altered by oxygen binding and Conformational states of serine protease
release? inhibitors (serpins)
Comparison of the oxy and deoxy structures has Figure 10.20 shows the serpin antithrombin III in
defined the changes in tertiary structures of indi- two conformational states, native and latent.
vidual subunits, and in the quaternary structure. Serpins show multiple conformational states with
The allosteric change involves an interplay between different folding patterns. Under physiological condi-
changes in tertiary and quaternary structure (see tions, the native states of inhibitory serpins are meta-
Figure 10.19). stable, converting spontaneously to the latent state.
In the native state (Figure 10.20a), the main b-sheet
• The details of the quaternary structure – the rela- (green) has five strands (the rightmost much shorter
tive geometry of the subunits and the interactions than the others). The reactive-centre loop (red) is
at their interfaces – is determined by the way the exposed, not participating in any secondary struc-
subunits fit together. ture. It is available to interact with a protease. In the
• The fit of the subunits depends on the shapes of latent state (Figure 10.20b), the reactive-centre loop
their surfaces. forms a sixth strand within the main b-sheet. The two
Many proteins change conformation as part of the mechanism of their function 321
(a)
(b)
44 Oxy 44 Oxy
44 Deoxy 44 Deoxy
36 Oxy 36 Oxy
Figure 10.19 Some important structural differences between oxy- and deoxyhaemoglobin [1HHO, 2HHB]. (a) Changes at the haem group
in human haemoglobin result in a change in state of ligation. This figure shows the F helix, proximal histidine, and haem group of the
b chain in the oxy (black) and deoxy (red) forms; only the oxy haem is shown. The structures were superposed on the haem group.
(b) The a1b1 dimer in oxy (red) and deoxy (black, in blown-up regions only) forms. In the blown-up regions, only the F helix, FG corner,
and haem group are shown. The oxy and deoxy a1b1 dimers have been superposed on their interface; in this frame of reference, there is
a small shift in the haem groups and a shift and conformational change in the FG corners. (c) Alternative packing of a1 and b2 subunits
in oxyhaemoglobin (red) and deoxyhaemoglobin (black). The oxy and deoxy structures have been superposed on the F and G helices
of the a1 monomer. Although for the purposes of this illustration we have regarded the a1 subunit as fixed and the b2 subunit as mobile,
only the relative motion is significant.
(a)
prediction then becomes a task of finding the global D. Jones has likened the distinction between fold
minimum of the conformational energy function recognition and a priori modelling to the difference
over all possible backbone and sidechain conforma- between a multiple-choice question on an examina-
tions. So far this approach has not generally suc- tion and an essay question.
ceeded, partly because of the imprecision of the
energy function and partly because the minimization
Homology modelling
algorithms tend to get trapped in local minima.
The alternative to a priori methods are approaches Model building by homology is a useful technique
based on assembling clues to the structure of a target when one wants to predict the structure of a target
sequence by finding similarities to known structures. protein of known sequence, when the target protein
These empirical or ‘knowledge-based’ techniques is related to at least one other protein of known
have become very powerful and are currently the sequence and structure. If the proteins are closely
most successful methods known. related, the known protein structures – called the
parents – can serve as the basis for a model of the
• Homology modelling. Suppose a target protein of
target. It is on homology modelling that we depend
known amino acid sequence but unknown struc-
to extend the results of structural genomics to the
ture is related to one or more proteins of known
entire protein world.
structure. Then we expect that much of the struc-
The completeness and quality of the results depend
ture of the target protein will resemble that of the
crucially on how similar the sequences are. As a rule
known protein. The related protein of known
of thumb, if the sequences of two homologous
structure can therefore serve as a basis for a model
proteins have 50% or more identical residues in
of the target protein. The challenge is to predict
an optimal alignment, the structures are likely to
how the differences between the sequences are
have similar conformations over more than 90%
reflected in differences between the structures. This
of the model. This is a conservative estimate, as
can be thought of as the ‘differential’ rather than
Figure 10.21 shows.
the ‘integral’ form of the folding problem.
Although the quality of the model will depend on
• Attempts to predict secondary structure without
the degree of similarity of the sequences, it is possible
attempting to assemble these regions in three
to specify this quality before experimental testing.
dimensions. The results are lists of regions of the
Therefore, knowing how good a model is necessary
sequence predicted to form a-helices and regions
for the intended application permits intelligent pre-
predicted to form strands of b-sheet.
diction of the probable success of the exercise.
• Fold recognition. Given a library of known struc- Steps in homology modelling are as follows.
tures, determine which of them shares a folding
pattern with a query protein of known sequence 1. Align the amino acid sequences of the target and
but unknown structure. If the folding pattern of the protein or proteins of known structure. Usu-
the target protein does not occur in the library, ally, insertions and deletions will lie in the loop
such a method should recognize this. The results regions between helices and sheets.
are a nomination of a known structure that has the 2. Determine mainchain segments to represent the
same fold as the query protein, or a statement that regions containing insertions or deletions. Stitch-
no protein in the library has the same fold as the ing these regions into the mainchain of the known
query protein. protein creates a model for the complete main-
• Prediction of novel folds, either by a priori or chain of the target protein.
knowledge-based methods. The results are a com- 3. Replace the sidechains of residues that have been
plete coordinate set for at least the mainchain mutated. For residues that have not mutated,
and sometimes the sidechains also. The model is retain the sidechain conformation. Residues that
intended to have the correct folding pattern, but have mutated tend to keep the same sidechain
would not be expected to be comparable in quality conformational angles and could be modelled on
to an experimental structure. this basis. However, computational methods are
324 10 Proteomics
10 20 30 40 50 60
| | | | | |
Chicken lysozyme KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINS
Baboon α-lactalbumin KQFTKCELSQNLY _ _ _
DIDGYGRIALPELICTMFHTSGYDTQAIVEND ESTEYGLFQISN
K F CEL D Y L C S TQA N ST YG QI
130
|
Chicken lysozyme VQA WIRGCRL_
_
Baboon α-lactalbumin _
L EQWL _ _ CE_K
(a) W C
(b)
Figure 10.21 (a) Aligned sequences and (b) superposed structures of two related proteins, hen egg white lysozyme (black) [1AKI] and
baboon a-lactalbumin (red) [1ALC]. The sequences are related (37% identical residues in the aligned sequences) and the structures are
very similar. Each protein could serve as a good model for the other, at least as far as the course of the mainchain is concerned.
now available to search over possible combina- A single parent structure will permit reasonable
tions of sidechain conformations. modelling of the conserved portion of the target pro-
4. Examine the model – both by eye and using pro- tein but will fail to produce a satisfactory model
grams – to detect any serious collisions between of the variable portion. A more favourable situation
atoms. Relieve these collisions, as far as possible, occurs when several related proteins of known struc-
by manual manipulations. ture can serve as parents for modelling a target pro-
tein. These reveal the regions of constant and variable
5. Refine the model by limited energy minimization.
structure in the family. The observed distribution of
The role of this step is to fix up the exact geo-
structural variability among the parents dictates an
metrical relationships at places where regions of
appropriate distribution of constraints to be applied
the mainchain have been joined together and to
to the model.
allow the sidechains to wriggle around a bit
Mature software for homology modelling is avail-
to place themselves in comfortable positions. The
able. SWISS-MODEL is a web site that will accept
effect is really only cosmetic – energy refinement
the amino acid sequence of a target protein, deter-
will not correct serious errors in such a model.
mine whether a suitable parent or parents for homo-
In most families of proteins, the structures contain logy modelling exist and, if so, deliver a set of
relatively constant regions and more variable ones. coordinates for the target. SWISS-MODEL (http://
The core of the structure of the family retains the www.expasy.org/swissmodel/SWISS-MODEL.html)
folding topology, although it may be distorted, but was developed by T. Schwede, M.C. Peitsch, and N.
the periphery can entirely refold (see Figure 10.11); Guex, now at the Geneva Biomedical Research Insti-
in contrast, hen egg white lysozyme and baboon tute. Another program in widespread use, MOD-
a-lactalbumin, shown in Figure 10.21, are closely ELLER, was developed by A. Šali. Each of these is
related and quite similar in structure. associated with a library of models corresponding to
Protein structure prediction and modelling 325
amino acid sequences in databanks. MODBASE ondary structure is to be predicted. The patterns are
(http://salilab.org/modbase) and 3DCrunch (http:// based on the distribution of residues at the positions
swissmodel.expasy.org/SM_3DCrunch.html) collect in the alignment table.
homology models of proteins of known sequence. The most powerful pattern recognition algorithms
now being applied to secondary structure prediction
include neural networks and hidden Markov models.
• Homology modelling is one of the most useful tech-
The basic idea is to develop a network of implica-
niques for protein structure prediction – when it is
applicable. tions containing a large set of adjustable parameters
governing the assignment of each residue to the three
classes: helix, sheet, or other. The systems are quite
Secondary structure prediction general and the parameters must be adjusted by
‘training’ on known sequences and structures.
The goal of secondary structure prediction is to iden-
The best current methods claim average accuracies
tify those residues within the sequence that will form
of Q3 ∼75%.
helices and strands of b-sheet in the native structure,
independent of their spatial arrangement in the ter-
tiary structure. • CASP categories change as the field progresses.
The original motivations for secondary structure Secondary structure prediction and fold recognition
prediction included: have been discontinued. Prediction of residue – residue
contacts, of disordered regions, and the ability to refine
• a belief that prediction of secondary structure models have been added.
should be substantially easier than prediction of
tertiary structure;
• a belief that prediction of secondary structure
Prediction of novel folds: ROSETTA
would be a step towards prediction of tertiary
structure; and ROSETTA is a program developed by D. Baker and
colleagues that predicts protein structure from amino
• both experimental evidence from co-polymers of
acid sequence by assimilating information from known
amino acids and statistical evidence from the
structures. At several recent CASP programmes,
observed residue compositions of helices and
ROSETTA showed the most consistent success on
b-sheets in solved protein structures implying that
targets in both the ‘novel fold’ and ‘fold recognition’
there are preferences among the residues for form-
categories.
ing (or breaking) helices.
ROSETTA predicts a protein structure by first gen-
Early work on secondary structure prediction made erating structures of fragments using known struc-
a priori predictions based on tables of residue pre- tures and then combining them. For each contiguous
ferences. Methods were tested according to a three- region of three and nine residues, instances of that
state model in which each residue in the prediction sequence and related sequences are identified in pro-
and in the experimental structure was assigned to teins of known structure. For fragments this small,
the classes ‘helix’, ‘sheet’, and ‘other’, and the per- there is no assumption of homology to the target
centage of residues assigned to the correct class protein. The distribution of conformations of the
was defined as a measure of success called Q3. Such fragments in the proteins of known structure models
methods achieved typical accuracies corresponding the distribution of possible conformations of the cor-
to Q3 ∼55%. responding fragments of the target structure.
Progress depended on the recognition that tables of ROSETTA explores the possible combinations of
many aligned sequences contained consensus infor- fragment conformations, evaluating compactness,
mation that could improve the prediction accuracy. paired b-sheets, and burial of hydrophobic residues.
The idea is to apply pattern-recognition algorithms The procedure carries out 1000 independent simula-
to a set of aligned sequences homologous to the tions, with starting structures chosen from the frag-
sequence of an unknown structure for which the sec- ment conformation distribution patterns evaluated
326 10 Proteomics
favourably. The structures that result from these Among their adaptations, membrane proteins con-
simulations are clustered and the centres of the larg- tain regions of mostly non-polar residues that interact
est clusters presented as predictions of the target with the organic layer. Many membrane proteins con-
structure. The idea is that a structure that emerges tain a set of seven consecutive a-helices that traverse
many times from independent simulations is likely to the membrane, oriented approximately perpendicular
have favourable features. to the plane of the membrane (see Figure 4.17). These
There is a general belief that the work of Baker and helices are connected by loops that protrude into
colleagues represents a major breakthrough in the the aqueous surroundings. A second class of mem-
field of protein structure prediction. brane protein structures contains a b-barrel. Trans-
Robetta (http://robetta.bakerlab.org) is a web membrane helices are typically 15–30 residues long.
server designed to integrate and implement the best Although enriched in hydrophobic residues, they
of the protein structure prediction tools. The central contain some polar sidechains, usually in interfaces
pipeline of the software involves first the parsing of a between a-helices packed together in the structure.
submitted amino acid sequence of a protein of unknown A useful clue to the orientation of the helices across
structure into putative domains. Then homology the membrane is the ‘+ inside rule’. The loops between
modelling techniques are applied to those domains helices lie either entirely inside or entirely outside the
for which suitable parents of known structure exist cell or organelle. Those inside contain a preponder-
and the de novo methods developed by Baker and co- ance of positively charged residues.
workers are applied to other domains. In addition, A simple approach to prediction of membrane pro-
the user will receive the results of other prediction teins involves looking for amino acid segments of
methods based on software developed outside the 15–30 residues in length that are rich in hydrophobic
Robetta group. These include, for example, predic- residues. However, signal peptides also contain
tions of secondary structure, coiled coils, and trans- hydrophobic helices: the signal sequence typically
membrane helices. comprises a positively charged n-region, followed by
Some methods are specialized to particular types a helical hydrophobic h-region, followed by a polar
of structure. c-region. Methods for recognizing transmembrane
helices in amino acid sequences tend to pick up the
Prediction of transmembrane proteins and h-regions of signal peptides as false positives. Methods
signal peptides for recognizing signal peptides in amino acid
Many proteins are designed to sit within membranes. sequences tend to pick up transmembrane helices as
Membrane proteins mediate the exchange of matter, false positives.
energy, and information between cell interiors and L. Käll, A. Krogh, and E.L.L. Sonnhammer trained
surroundings. Examples of membrane protein func- hidden Markov models to test simultaneously for
tions include energy transduction via the generation transmembrane helices and signal peptides. The goals
or release of concentration gradients across cell or are to find both at the same time, to discriminate
organelle membranes, and signal reception and between them in the results, and to predict not only
transmission. the positions of the transmembrane helices but also
It is estimated that, in the human genome, approxi- the locations – cytoplasmic or interior – of the loops.
mately 30% of genes encode membrane proteins. The method, called Phobius, is available at http://
Approximately 70% of known targets of drugs are phobius.cgb.ki.se/.
membrane proteins. Given that membrane proteins Phobius is the most successful algorithm currently
are so common, it is important to have reliable tools available for recognizing signal peptides and helical
for their identification. Relatively few membrane transmembrane proteins, and for predicting the ori-
protein structures have been determined experiment- entation of the transmembrane segments. Phobius
ally. This places a greater burden on computational is capable of distinguishing h-domains of signal
tools for sequence analysis, to identify and character- peptides from transmembrane helices: the number of
ize them. false classifications of signal peptides was 3.9%, and
Protein structure prediction and modelling 327
Homology modelling
One strand of Darwin’s thinking that led to the the- One day, we shall be able to design amino acid
ory of evolution was the observation that farmers sequences a priori that will fold into proteins with
could improve the quality of livestock by selective desired functions. As this is not yet possible, scien-
breeding. He drew an analogy between this artificial tists have used directed evolution – or artificial selec-
selection and the idea of natural selection that he was tion – to generate molecules with novel properties
proposing as the mechanism of evolution. We now starting from natural proteins.
recognize that evolution by natural selection takes Evolution requires the generation of variants and
place at the molecular level. Why not artificial selec- differential propagation of those with favourable fea-
tion also? tures. Molecular biologists dealing with microbial
Natural proteins do many things, but not everything evolution have advantages over the farmers that
we would like them to. For applications in techno- Darwin observed. We can generate large numbers of
logy, it would be useful to have proteins that would: variants artificially. Screening and selection can, in
many cases, be done efficiently, by stringent growth
• have activities unknown in nature;
conditions, and there are virtually no limits on the
• show activity towards unnatural substrates, or size of the ‘flock’ or ‘litter’. Darwin might well have
altered specificity profiles; been envious. He wrote:
• be more robust than natural proteins, retaining
. . . as variations manifestly useful or pleasing to man
their activity at higher temperature or in organic
appear only occasionally, the chance of their appearance
solvents; and
will be much increased by a large number of individuals
• show different regulatory responses, enhanced being kept. Hence, number is of the highest importance for
expression, or reduced turnover. success. On this principle Marshall formerly remarked,
Directed evolution and protein design 329
with respect to the sheep of parts of Yorkshire, ‘as they this far? Are all of the changes essential for thermo-
generally belong to poor people, and are mostly in small stability, or has there been considerable neutral drift
lots, they never can be improved.’ as well?
– The Origin of Species, Chapter 1. A thermostable variant of subtilisin E, produced by
The procedure of directed evolution comprises directed evolution, differs from the wild type by only
these steps: eight residue substitutions. The variant is identical to
thermitase in its temperature of optimum activity:
1. Create variant genes by mutagenesis or genetic 76°C (17°C higher than the original molecule) and
recombination. stability at 83°C (a 200-fold increase relative to the
2. Create a library of variants by transfecting the wild type).
genes into individual bacterial cells. The procedure involved successive rounds of gen-
3. Grow colonies from the cells and screen for desir- eration of variants, and screening and selection of
able properties. those showing favourable properties. The formation
of mutations, via error-prone PCR to produce an
4. Isolate the genes from the selected colonies and
average of two to three base changes per gene, was
use them as input to step 1 of the next cycle.
alternated with in vitro recombination to find the
Strategies for generating variants include (a) single best combinations of substitutions at individual sites
and multiple amino acid substitutions, (b) recombina- (Figure 10.23). At each step, several thousand clones
tion, and (c) formation of chimaeric molecules by were screened for activity and thermostability.
mixing and matching segments from several homolog- The optimal variant differed from the wild type
ous proteins. Each method has its advantages and at eight positions: N188S, S161C, P14L, N76D,
disadvantages. The smaller the change in sequence, G166R, N181D, S194P, and N218S. Figure 10.24
the more likely that the result will be functional. shows their distribution in the structure. Most of the
Yet, multiple substitutions or recombinations give a substitutions are far from the active site, which is
greater chance of generating novel features. The choice not surprising as the wild type and variant do not
depends in part on the nature of the goal. For in-
stance, it is easier to lose a function than to gain one.
8
(Why would you want to lose a function? Removal
of product inhibition to enhance throughput in an 7
enzymatically catalysed process is an example.) Random
6
mutagenesis
In t1/2 at 65°C
5
Directed evolution of subtilisin E Recombination
4
Subtilisins are a family of bacterial proteolytic
enzymes. Subtilisin E, from the mesophilic bacterium 3
Bacillus subtilis, is a 275-residue monomer. It 2
becomes inactive within minutes at 65°C. Directed
evolution has produced interesting variants, with 1
Enhancement of thermal stability by directed evolution Figure 10.23 Directed evolution of a thermostable subtilisin.
Thermitase, a subtilisin homologue from Thermoac- The starting wild type (WT) was subtilisin E from the mesophilic
tinomyces vulgaris, remains stable up to 80°C. The bacterium B. subtilis. Steps of random mutagenesis were
alternated with recombination. At each step, screening for
existence of thermitase is reassuring, because it shows
improved properties and artificial selection chose candidates
that the evolution of subtilisin to a thermostable pro- for the next round. (t1/2 measured in minutes.)
tein is possible. However, subtilisin E and thermitase After: Zhao, H. & Arnold, F.H. (1999). Directed evolution converts subtilisin
differ in 157 amino acid residues. Do we have to go E into a functional equivalent of thermitase. Prot. Eng. 12, 47–53.
330 10 Proteomics
Enzyme design
S161C
S194P We know that mutations can change the structure
and function of proteins, in some cases in radical
ways. We observe this in natural protein evolution
and can achieve it by artificial selection (see preced-
G166R
ing section). We can do a reasonable job at predicting
N188S
the structural changes arising from mutations. Put-
ting these together suggests that we should be able to
N181D
18
S design proteins with altered or even novel functions
N2
in silicio.
Gramicidin S is a cyclic decapeptide antibiotic
P14L
from Bacillus brevis. It contains unusual amino acids,
N76D
including d-phenylalanine and ornithine. Synthesis of
gramicidin S is independent of the normal ribosomal
protein-synthesizing machinery. Instead, the enzyme
Figure 10.24 The sites of mutation in B. subtilis subtilisin E gramicidin synthetase isomerizes the substrate l-
that produced a thermostable variant by directed evolution. phenylalanine to d-phenylalanine, and activates it
The sidechains shown are those of the final product.
to the amino acid adenylate. Another component of
the enzyme effects the polymerization, in a sequence-
differ in function. Most of the sites of substitution specific manner.
are in loops between regions of secondary structure. C.-Y. Chen, I. Georgiev, A.C. Anderson, and
These regions are the most variable in the natural B.R. Donald computationally redesigned gramicidin
evolution of the subtilisin family. However, only two synthetase to accept other amino acids as substrates.
of the substitutions produce the amino acids that The wild-type enzyme has no activity towards Arg,
appear in those positions in thermitase. Two of the Glu, Lys, and Asp. Their computational modelling
substitutions are in a-helices, including P14L. P14L approach successfully predicted the sequences of
has a certain logic: proline tends to destabilize an modified enzymes with activity for each of these four
a-helix because it costs a hydrogen bond. unnatural substrates.
Within cells, life is organized and regulated by a • non-fibrous structural aggregates such as viral
set of protein–protein and protein–nucleic acid capsids;
interactions. • large aggregates with dynamic properties such as
Interacting proteins and nucleic acids span a range F1-ATPase, pyruvate dehydrogenase, the GroEL–
of structures and activities: GroES chaperonin, and the proteasome;
• simple dimers or oligomers in which the mono- • protein–nucleic acid complexes, including ribo-
mers appear to function independently; somes, nucleosomes, transcription regulation com-
• oligomers with functional ‘cross-talk’, including plexes, splicing and repair particles, and viruses;
ligand-induced dimerization of receptors and • many proteins, whether monomeric or oligomeric,
allosteric proteins such as haemoglobin (Fig- which function by interacting with other proteins.
ure 1.16), phosphofructokinase, and asparate These include all enzymes with protein substrates
carbamoyltransferase; and many antibodies, inhibitors, and regulatory
• large fibrous proteins such as actin or keratin; proteins.
Protein complexes and aggregates 331
individual person or animal, and (c) infect other The normal role of PrPC is not clear. Mice in which
individuals, by various routes including ingestion of PrPC has been knocked out develop normally for a time,
nervous tissue from an affected animal (or person, in and eventually die of apparently unrelated develop-
the case of kuru). mental defects. In fact, PrPC-knockout mice are not
susceptible to infection with PrPSc, an observation
• PrPC = prion protein-Cellular; important in proving the mechanism of the disease.
PrPSc = prion protein-Scrapie. The nature of the conformational change is still
not entirely clear. The change from a to b structure
shown by circular dichroism is one clue. In prin-
Prion disease presents widespread health problems
ciple, prion proteins show multiple structures from
for humans and animals. In 2001 a serious epidemic
one polypeptide sequence. However differences in
of bovine spongiform encephalopathy (colloquially,
glycosylation patterns between PrPC and PrPSc have
‘mad-cow disease’) devastated the United Kingdom
been reported; these may play a role in defining the
countryside. There was an apparent association with
conformation.
the appearance of human cases of variant Creutzfeld–
The mechanism by which PrPSc catalyses the trans-
Jacob disease (vCJD). In the hereditary disease famil-
formation of additional PrPC to PrPSc is also not clear.
ial CJD, symptoms began to appear in people aged
Inherited prion diseases are associated with mutants,
55–75. Variant CJD affected people in their twenties.
presumably increasing the tendency for conforma-
It is hypothesized that these outbreaks were asso-
tional mobility (Table 10.2). A related question con-
ciated with transmission of prion protein infections
cerns the kinetics of the process – what governs the
across species barriers: sheep to cows for BSE, and
rate of accumulation of aggregates that causes many
cows to humans for vCJD.
prion diseases to appear only among the elderly?
Prion proteins form a family of homologous pro-
teins in many species of animals, and also in yeast, but
• Many diseases arise from formation of protein ag-
apparently not in C. elegans or Drosophila. Normal
gregates, including: sickle-cell anaemia, amyloidoses,
human prion protein is synthesized as a 253-residue Alzheimer’s disease, Huntington’s disease, familial and
polypeptide. This comprises: an N-terminal signal variant Creutzfeld–Jacob disease, prion diseases. Most
peptide, followed by a domain containing ∼5 tandem of these are genetic, some are infectious.
repeats of the octapeptide PHGGGWGQ (in mam-
mals), a conserved 140-residue domain, and a C-
terminal hydrophobic domain. The signal domain Properties of protein–protein complexes
and the C-terminal domain are cleaved off, and the
protein is anchored to the extracellular side of the Stoichiometry – what is the composition of the
cell membrane of neurons by a GPI (glycosylphos- complex?
phatidylinositol) group bound to the C-terminal resi- Protein complexes vary widely in the numbers and
due Ser231 of the mature protein. variety of molecules they contain. Some contain only
The archaeal proteasome contains 14 identical a sub- Allosteric activator Monovalent ion 10−4–10−2
units and 14 identical b subunits. They are arranged in Co-enzyme binding NAD, for instance 10−7–10−4
four stacked rings, a7–b 7–b 7–a7. All b subunits have pro-
Antigen–antibody Various 10−4–10−16
tease activity. The core of eukaryotic proteasome also complexes
contain the a7–b7–b7–a7 stacked ring structure, but each
Thrombin inhibitor Hirudin 5 × 10−14
ring contains seven diverged and non-identical subunits.
Trypsin inhibitor Bovine pancreatic 10−14
Thus, the eukaryotic proteasome contains seven homo-
trypsin inhibitor
logous but non-identical a subunits and seven homo-
Streptavidin Biotin 10−15
logous but non-identical b subunits. Only three of the
eukaryotic b subunits have protease activity. The eukary-
otic proteasome also contains large regulatory subunits
in addition to the a–b rings, which select ubiquitinylated
Dissociation constants of protein–ligand com-
proteins for degradation. plexes span a wide range, as shown in Table 10.3.
Structural studies have elucidated several import-
ant features of the interactions between soluble
proteins, that contribute to affinity.
a few proteins; others are very large. For example,
pyruvate dehydrogenase contains hundreds of sub- • What holds the proteins together? Burial of
units and some viral capsids contain thousands. hydrophobic sidechains, hydrogen bonds and salt
Some prokaryotic proteins containing identical bridges, and Van der Waals forces. A typical
subunits are homologous to eukaryotic proteins con- protein–protein interface might involve 22 resi-
taining related but non-identical subunits, arising dues and 90 atoms, of which 20% would be main-
by gene duplication and divergence. The proteasome chain atoms, and an occasional water molecule
is an example (see Box 10.8). Some viruses achieve (Box 10.9). Burial of 1 Å2 contributes ∼100 J mol−1
diversity without duplication, by combining proteins to stability. There is, on average, one intramolecu-
with the same sequence but different conformations. lar hydrogen bond per 170 Å2 of interface area.
The average value of the surface area buried
Affinity – how stable is the complex? in binary protein complexes is ∼1600 Å2. The
The measure of the affinity of a complex is the dis- minimum buried surface for stability of a protein–
sociation constant, KD, the equilibrium constant for protein complex is ∼1000 Å2.
the reverse of the binding reaction: • Do proteins change conformation in complexing?
In some cases the interaction energy has to ‘pay
[P][L]
protein–ligand = protein + ligand KD = for’ the conformational change and the interface
[PL]
tends to be correspondingly larger. Complexes
where [P], [L], and [PL] denote the numerical values that involve conformational changes generally
of the concentrations of protein (P), ligand (L), and bury >2000 Å2.
protein–ligand complex (PL), respectively, expressed • What determines specificity? Complementarity
in mol l−1. The lower the KD, the tighter the binding. of the occluding surfaces, in shape, hydrogen-
KD corresponds to the concentration of free ligand at bonding potential, and charge distribution.
which half the proteins bind ligand and half are free: Prediction of protein complexes from the struc-
[P] = [PL]. tures of the partners is the docking problem.
Reliable solution of this problem, together with
progress in structural genomics, would permit
• The Michaelis constant of an enzyme is the dissociation
in silicio screening of proteomes for interacting
constant of the enzyme–substrate complex.
partners.
334 10 Proteomics
How are complexes organized in three dimensions? two proteins interact using the same surface on both,
the complex is closed. If two proteins interact through
When two proteins form a complex, each leaves a different surfaces, the complex is open. The signi-
‘footprint’ on the surface of the other, defining the ficance is that a closed complex does not allow
portion of the surface involved in the interaction. If additional proteins to bind with the same interaction.
BOX A protein–protein interface: phage M13 gene III protein and E. coli TolA
10.9
During infection of E. coli by phage M13, a complex forms The complex is stabilized by burial of 1765 Å2 of surface
between the N-terminal domain of the minor coat gene area, by combination of b-sheets from both proteins to
3 protein of the phage and the C-terminal domain of a form an extended b-sheet (see Figure 10.25a) and by
receptor protein in the bacterial cell membrane, TolA (see several linkages of sidechains by hydrogen bonds and salt
Figure 10.25). bridges. The area buried in the complex is divided almost
evenly between the two partners.
(a)
An open complex, in which the surface of potential • Is the structure open or closed? In an open struc-
interaction is not occluded, can grow by accretion ture, at least one of the sites forming the binding
of additional subunits. Thus, open but not closed surface is exposed in at least one of the subunits,
complexes are compatible with the formation of so that additional subunits could be added on. In
aggregates by continued addition of monomers mak- a closed structure, all binding surfaces are in con-
ing the same interaction. tact with partners and the assembly is saturated.
Domain swapping – exchange of segments between
two interacting domains – often but not always
Multisubunit proteins
produces closed isologous dimers.
An important class of protein–protein complexes is • What is the symmetry of the structure? Symmetry
oligomeric or multisubunit proteins. We appeal to is the rule, rather than the exception, in structures
structural biology to address the following questions. of oligomeric proteins. The subunits in most
• What is the stoichiometry? How many different dimers are related by an axis of twofold symmetry.
types of subunit appear and how many of each are Yeast hexokinase is an exception. It forms an
present? Most proteins are homodimers or homo- asymmetric dimer. In the human growth hormone
tetramers. Monomers and heterooligomers are less receptor, a nearly symmetric dimer binds an asym-
common. The ribosome is an extreme example of metric ligand (see Figure 10.26).
a heterooligomer. Proteins containing odd num- • Do any of the subunits undergo conformational
bers of subunits are rarer than those containing changes on assembly? Often we don’t know.
even numbers of subunits. In cases of extensively interlocked interfaces,
• What is the relationship between the contributions such as the Trp repressor, the monomers could
of different subunits to the interface? Consider a not adopt the same structure in the absence of
dimer of two identical subunits: in isologous bind- their partners. Allosteric proteins can undergo
ing, the interface is formed from the same sets of ligand-dependent conformational changes. In ATP
residues from both monomers; in heterologous synthase, a threefold symmetric complex of ab
binding, different monomers contribute different subunits is distorted by interaction with the g
sets of residues to the binding site. A handshake is subunit.
isologous.
Figure 10.26 Human growth hormone (blue) in complex with two molecules illustrating the dimerized exterior domain of its receptor
(green, orange) [3hhr].
336 10 Proteomics
● RECOMMENDED READING
Exercises
Exercise 10.1 For each of the following amino acids, say whether they are hydrophobic, polar,
positively charged at pH 7, or negatively charged at pH 7: (a) leucine; (b) aspartic acid;
(c) glutamine; (d) phenylalanine; (e) lysine.
Exercise 10.2 (a) Identify a hydrophobic amino acid that is more bulky than alanine but less bulky
than leucine. (b) Identify two amino acids that have almost the same size and shape (differing only
in that one has a methyl group and the other has a hydroxyl group).
Exercise 10.3 On a photocopy of Figure 10.2, indicate the bond, a rotation around which would
correspond to the conformational angle yi−1.
Exercise 10.4 Would it be possible, by rotation around bonds shown in Figure 10.2, to convert
residue i from the L to the D conformation?
Exercise 10.5 Estimate the values of f and y that correspond to the aL conformation in Figure 10.3.
Exercise 10.6 Figure 10.2 shows the trans conformation of the polypeptide chain. (a) If the angle
labelled wi in that figure is changed from w = 180° trans to w = 0° cis, keeping the positions of
all atoms in residues i − 1 and i fixed, what atom would occupy the position currently occupied
Exercises, problems, and weblems 337
by Hi+1? (b) In the structure shown in Figure 10.2, fi = 180°. The only unlabelled atom is the
hydrogen connected to Cia. Assuming a rotation that keeps the positions of Ni, Hi, and the atoms
of residue i − 1 fixed, estimate the value of fi that would place this unlabelled hydrogen atom at
the position that C ib occupies in Figure 10.2.
Exercise 10.7 Describe and compare the nature of the accelerating and retarding forces on the
molecules in mass spectrometry and SDS-PAGE.
Exercise 10.8 In a typical protein–protein interface of area 1700 Å2: (a) how many intermolecular
hydrogen bonds would you expect to be formed? (b) How many fixed water molecules would you
expect to find in the interface? (c) If the entire buried area were hydrophobic, what contribution to
the free energy of stabilization would you estimate it to make?
Exercise 10.9 In the dimer between syntrophin and neuronal nitric oxide synthase (see Figure 10.27),
(a) is the dimer structure open or closed? (b) What secondary structure element is shared between
the two domains?
Figure 10.27 Interaction between PDZ domains in syntrophin (cyan) and neuronal nitric oxide synthase
(magenta) [1QAV].
Problems
Problem 10.1 P. Schultz has posed the question: would an extended genetic code – perhaps
one in which one of the rarely used stop codons coded for a novel amino acid with a somewhat
unusual size, shape, or charge distribution – be ‘better’ than the normal one? The question could
be taken to apply to either natural or artificial exensions of the code. How would you design
experiments to answer this question? What precautions would you consider necessary?
Problem 10.2 As a general rule, vertebrates use creatine as a phosphogen and invertebrates use
arginine. Figure 10.28 shows the sequence alignment of creatine kinases (CK) from rabbit and
chicken, and arginine kinases (AK) from sea cucumber, horseshoe crab, and abalone. The numbers
of identical residues in pairs of sequences in this alignment are:
10 20 30 40 50 60
| | | | | |
Rabbit CK (KCRM_RABIT) GNTHNKYKLNYKSEEEYPDLSKHNNHMAKVLTPDLYKKLRDKETPSGFTLDDVIQTGVDN
Chicken CK (KCRS_CHICK) _
ATVHEKRKL FPPSADYPDLRKHNNCMAECLTPAIYAKLRDKLTPNGYSLDQCIQTGVDN
Sea cucumber AK (KARG_STIJA) MANLNQKKYPAKDDFPNFEGHKSLLSKYLTADMYAKLRDVATPSGYTLDRAIQNGVDN
Horseshoe crab AK (KARG_LIMPO) MVDQATLDKLEAGFKKLQEASDCKSLLKKHLTKDVFDSIKNKKTGMGATLLDVIQSGVEN
Abalone AK (KARG_NORMA) MLAMASVEELWA___ KLDGAADCKSLLKNNLTKERYEALKDKKTKFGGTLADCIRSGCLN
LT y l dk T G tL Iq Gv N
370 380
| |
Rabbit CK (KCRM_RABIT) LMVEMEKKLEKGQSIDDMIPAQK____
Chicken CK (KCRS_CHICK) YLVDCEKKLEKGQDIKVPPPLPQFGRK
Sea cucumber AK (KARG_STIJA) VLIEMEKKLEKGESIDDLVPK______
Horseshoe crab AK (KARG_LIMPO) EMIKMEKAAA_________________
Abalone AK (KARG_NORMA) ACLAKEKELAAAKK_____________
EK l
Figure 10.28 Alignment of creatine kinase (CK) from rabbit and chicken, and arginine kinase (AK) from sea cucumber, horseshoe crab,
and abalone.
Exercises, problems, and weblems 339
(a) Does sea cucumber arginine kinase appear to be more related to vertebrate creatine kinases
or other invertebrate arginine kinases? (b) On a photocopy of Figure 10.28, circle (at least two)
regions, each at least four residues long, in which sea cucumber arginine kinase resembles
vertebrate creatine kinases more closely than it resembles other invertebrate arginine kinases, and
circle (at least two) regions, each at least four residues long, in which sea cucumber arginine kinase
resembles other invertebrate arginine kinases more closely than it resembles vertebrate creatine
kinases. (c) Can you identify any residues that might conceivably be responsible for the difference
in substrate specificity between arginine and creatine? (d) Outline how you could test the
hypothesis you presented as your answer to part (c), using only computational and not wet
laboratory methods. (e) Which is more likely: that arginine kinase activity evolved once and that
no protein in any ancestor of sea cucumber had creatine kinase activity, or that sea cucumber
arginine kinase evolved from a precursor it shared with present vertebrate creatine kinases?
Explain your reasoning.
Weblems
Weblem 10.1 Choose one of the following glycoprotein storage diseases: (1) aspartylglucosaminuria,
(2) a-mannosidosis, (3) b-mannosidosis, (4) Sandhoff–Jatzkewitz disease, or (5) sialidosis. (a) What
is the inheritance pattern of this disease? (b) What is the incidence of this disease in the USA; i.e.
what fraction of the population is affected? (c) What is the biochemical defect that causes this
disease?
Weblem 10.2 Some athletes take creatine in order to build up their ability to store energy as
phosphocreatine. Athletes find that creatine certainly improves performance in ‘burst’ events –
all-out effort for up to 10 seconds – but its efficacy for ‘endurance events’ is debated. Examples
of burst events include the 100 metre dash, a point in tennis, or a play in football. In all of these
cases, there is a pause between energy bursts to allow replenishment of phosphocreatine by
oxidative metabolism. This takes ∼30–60 seconds. (The International Tennis Federation rules allow
no more than 20 seconds between points.) (a) How does the US Food and Drug Administration
classify creatine? (b) Is creatine banned by the Olympics or major commercial sports enterprises?
(c) Why would it be difficult to enforce a ban on creatine supplements?
Weblem 10.3 The bacterium Pseudomonas fluorescens and the fungus Curvularia inaequalis each
possesses a chloroperoxidase, an enzyme that catalyses halogenation reactions. Do these enzymes
have the same folding pattern?
This page intentionally left blank
CHAPTER 11
Systems Biology
LEARNING GOALS
• To gain a sense of the discipline of systems biology as an integrative approach to all the ‘omics’
disciplines.
• To understand the idea of networks and their representation as graphs.
• To appreciate that many aspects of metabolic networks are shared by different organisms, and
that they evolve.
• To know the databases dealing with metabolic pathways, including EcoCyc, KEGG, WIT, and
BRENDA, and be fairly fluent at using them and the links they contain.
• To distinguish between static and dynamic aspects of biological networks.
• To understand the ideas of stability and robustness and the mechanisms by which life achieves
them.
• To appreciate how computational concepts important for systems biology, such as randomness
and complexity, have been made precise and quantitative.
• To understand the structure, dynamics, and evolution of metabolic networks.
• To know the different ways of experimentally determining protein–protein and protein–nucleic
acid interactions.
• To be familiar with some of the basic types of DNA-binding protein.
• To understand the structures, dynamics, and evolution of regulatory networks.
• To appreciate the adaptability of the yeast regulatory network.
342 11 Systems Biology
The goal of systems biology is the synthesis of all contacts in their assembly. A transcription regulat-
biological data into a unified picture of the structure, ory network is a network of genes, exerting logical
dynamics, logistics, and ultimately the logic of living control over expression patterns via the synthesis of
things. Systems biology focuses on the integration of specific DNA-binding proteins. A transcription factor
gene, RNA, and protein activity. that acts by binding to DNA may never interact
Molecules are social animals and life depends on physically with the proteins the expression of which
their interactions. As individual molecules have spe- it controls. Metabolic pathways have a similar duality:
cialized functions, control mechanisms are required to many but not all metabolic pathways are mediated by
organize and coordinate their activities. Failure of con- physical protein–protein interactions and regulated
trol mechanisms can lead to disease and even death. by logical ones.
Examples of purely physical interactions include
macromolecular complexes – both multiprotein com-
Two parallel networks: physical and logical
plexes and protein–nucleic acid complexes. Examples
Systems biology deals with networks. Networks con- of logical interactions not mediated entirely by direct
sist of sets of molecules and the interactions among physical interaction between proteins include feed-
them. There are networks of genes, of RNAs, of pro- back loops in which the increase in concentration of
teins, and of metabolites. The same set of molecules a product of a metabolic pathway inhibits an enzyme
may be connected by different types of interaction or catalysing one of the early steps in the pathway, or
relationship, to form different networks (see Box 11.1). the secretion of a small molecule as a signal to other
In cells, two interaction networks are in opera- cells, in ‘fire and forget’ mode (see Box 11.2). In these
tion: (a) a physical network of protein–protein and cases, the logical interaction is transmitted by diffu-
protein–nucleic acid complexes, and (b) a logical net- sion of a small molecule, rather than physical contact
work of control cascades. Interactions may be physical between source and recipient of the signal.
or logical. Often, they are both. Physical and logical The allosteric change in haemoglobin is an example
networks operate in parallel. A macromolecular of simultaneous physical and logical interaction:
complex such as the ribosome is a network of pro- the subunits of haemoglobin respond to changes in
teins and RNAs, interacting through the physical oxygen levels by a conformational change that alters
oxygen affinity. Another example is the transmission
of a signal from the surface of a cell across the mem-
BOX Some networks in systems biology
11.1 brane to the interior by dimerization of a receptor.
This can be the initial trigger of a process that ulti-
mately affects gene expression. Not all links of this
Network Element of Connection between process need involve protein–protein interactions;
network elements some may be mediated by diffusion of small mole-
Genomes Gene
cules such as cyclic AMP.
Homology or shared
expression pattern or linkage Even though particular complexes may participate
Protein Protein Homology or regulatory
in both physical and logical networks, the two net-
relationship or shared works remain distinct in terms of their organization
expression pattern or and their biological function, and it is useful to keep
physical complex formation the distinction between them in mind, especially
Metabolite Chemical Substrate and product of when they overlap.
compound an enzymatic reaction or
e.g. glucose similarity in structure or
similarity in reactivity • Cells contain both physical and logical networks. Many
interactions are common to both.
Introduction to systems biology 343
• a cycle is a path of length >2 for which the initial how easy is it to transfer from one to another, i.e.
and final end points are the same, but in which no what is the nature of the patterns of connectivity?
intermediate link is repeated. In case of failure of one or more links, is the network
Sequences of consecutive metabolic reactions are robust, i.e. does it remain connected?
pathways in a graph of metabolites. An irreversible
reaction corresponds to a directed edge. A concatena- Trees
tion of signal-transduction events is a pathway in a
regulatory network. A tree is a special form of graph (see p. 182). A tree
is a connected graph containing only one path
between each pair of vertices. A hierarchy is a tree:
• A vitamin is a compound that we must eat because we
examples include military chains of command and
cannot synthesize it. Therefore, there can be no path in
the Linnaean taxonomy. A tree cannot contain a
the metabolic network leading to a vitamin.
cycle: if it did, there would be two paths from the
initial point (= the final point) to each intermediate
For some networks, such as metabolic pathways point. In the undirected graph on page 344, the
or patterns of traffic in cities, the dynamics of the subgraph consisting of vertices V1, V2, V4, V5, and V6
system depend on the transmission capacities of the is a tree. Adding an edge from V1 to V5 would create
individual links. These capacities can be indicated as an alternative path from V1 to V5, and the cycle
labels of the edges of the graph. This allows model- V1→V2→V4→V5→V1; the graph would no longer
ling of patterns of flow through the network. Examples be a tree.
include route planning, in travel or deliveries. Note The density of connections is the mean number of
that the shortest path may well not give optimal edges per vertex and characterizes the structure of
throughput. In many cities, taxi drivers are exqui- a graph. A fully connected graph contains an edge
sitely sensitive – and insensitively voluble – about between every pair of nodes. A fully connected graph
currently optimal traffic paths. of N vertices has N − 1 connections per vertex.
A graph that contains a path between any two A graph with no edges has 0 connections per node.
vertices is said to be connected. Alternatively, a graph Nervous systems of higher animals achieve their
may split into several connected components. The power not only by containing large number of
graph on page 344 has two connected components, neurons but also by high degrees of connectivity.
one containing five vertices and one containing only Sometimes there are limits on numbers of connec-
one vertex. (In the extreme case, a graph could tions. For many human societies, in the graph in which
contain many vertices but no edges at all.) It is often individuals are the vertices and edges link people
useful to determine the shortest path between any married to each other, each node has connectivity
two nodes, and to characterize a network by the 0 or 1. Hydrocarbon structures can be represented
distribution of shortest path lengths. The phrase ‘six as graphs, with the hydrogen and carbon atoms as
degrees of separation’ – the title of a play by John vertices and the chemical bonds as the edges. The
Guare, made into a film – refers to the assertion rules of valence require that each node corresponding
(attributed originally to Marconi) that if the people to a carbon atom has ≤4 connections.
in the world are vertices of a graph and the graph In other networks, connectivities follow statistical
contains an edge whenever two people know each regularities. For instance, the World Wide Web can
other, then the graph is connected and there is a path be considered to be a directed graph: individual doc-
between any two vertices with length ≤6. uments are the nodes and hyperlinks are the edges.
The London Underground network is connected The distribution of incoming and outgoing links fol-
in that there is (usually) a route between any two lows power laws: P(k) = probability of k edges = k−q,
stations. Many questions familiar to commuters are where q = 2.1 for incoming links and q = 2.45 for
shared in the analysis of biological networks; for outgoing links (see Box 11.5).
example: what are the paths connecting station A and The density of connections is very important in
station B? Regarding different lines as subnetworks, defining the properties of a network. For instance,
346 11 Systems Biology
sequence of characters:
three-dimensional structures, and – especially – Another way to look at this is directly relevant to
networks. Indeed, most types of biological infor- systems biology: the dynamics of non-chaotic sys-
mation can be regarded as networks. For instance, tems are robust to small changes in initial conditions,
a nucleotide sequence is equivalent to a network in but the dynamics of chaotic systems are not robust to
which the individual bases are the nodes, and each small changes in initial conditions.
base is connected by a directed edge pointing to
the next base. That’s a perfectly proper graph! Chaos and predictability
Conversely, recognizing that sequences are networks The discovery of the laws of mechanics in the 17th
can usefully lead us to ask – can we define analogues century – Newton’s Principia was published in 1687
of sequence alignment for more general networks? – gave rise to the hope that the dynamics of the solar
(Yes, we can.) system in particular (and much, if not all, of the uni-
In biology, we are also interested in the complexity verse in general) was predictable. Laplace expressed
of processes. the view that:
A problem that can be solved in polynomial time is said to check that each number is greater than or equal to its
be in class P. O(N log N) algorithms are faster than O(N2) predecessor, which can be done by looking at each ele-
and are, therefore, in class P. ment of the list once. Therefore, sorting a list of numbers
Suppose, however, that the optimal algorithm to solve into order is a problem in class NP. (Sorting also happens
a problem has order worse than polynomial – for instance, to be in class P; sorting algorithms are known with order
it might have exponential order O(2N) – but that if you O(N log N).)
propose a solution, it can be checked in polynomial time.
Such a problem is said to be of class NP. (NP does not NP-complete problems. Does P = NP?
stand for non-polynomial, but for non-deterministic poly- Many NP problems have equivalent complexities, in the
nomial, referring to a different model for the computation. sense that if a polynomial algorithm were discovered
Don’t worry about this technical distinction.) for one, it could be applied to solve others. The set of
Consider the problem of sorting a list of numbers into NP-complete problems is the set of NP problems such that
order. That is, given a series of N numbers: 2,1,7,5,8,4,3, if we could solve any one of them in polynomial time, we
. . . an algorithm must produce as output the numbers would be able to solve all of them in polynomial time. In
rearranged into order: 1,2,3,4,5,7,8, . . . Whatever the other words, the discovery of a polynomial-time algorithm
order of the optimal algorithm that solves the problem, for any problem known to be NP-complete would cause
an algorithm to verify that 1,2,3,4,5,7,8, . . . is a solution the classes P and NP-complete to coalesce. But are there
(or that 1,8,7,2,4,5,3, . . . , is not a solution) can run in any NP problems that are not in class P? This is the famous
time linear in the length of the list. It is necessary only to unsolved conjecture of computer science: does P = NP?
The metabolome 351
What is the a relationship between computational – for instance, how much of each gene should be
complexity and our notions of the complexity asso- transcribed at the moment. Now, the classical theory
ciated with entropy, randomness, and predictability? of computational complexity applies to traditional
Think of an algorithm as operating on a set of input computer architectures, in which successive opera-
data. The algorithm might extract information from tions are executed one at a time. The inherent com-
the data, as in a program that solves an equation. plexity of many biological computations implies that
Or the algorithm might modify the data, as in sorting cells could not use this organization in their calcula-
a list of numbers. The successive steps of the algo- tions. This is an inherent constraint on the design
rithm leave a trace as a sequence of intermediate of living regulatory systems. In fact, many computa-
steps. We can analyse the complexity of the trace, just tions within cells operate as parallel processes. These
as we do any other sequences, including genome are not subject to the constraints derived from the
sequences. classical theory of computational complexity.
The theory of computational complexity places Computer scientists are extending the theory of
general limits on the efficiency of computations, complexity to alternative computer architectures.
independent of the nature of the hardware. In prin- The comparison of the constraints imposed by differ-
ciple, these limits constrain cells as much as they do ent computer architectures affords insight into how
human programmers. Cells do lots of computations cells organize different calculations.
The metabolome
Classification and assignment of protein Note that several reactions, involving different
function alcohols, would share this number (whether or not
the same enzyme catalysed them); but that the same
Proteins have a very wide variety of functions. In
dehydrogenation of one of these alcohols by an
systems biology, there are two particular classes
enzyme using the alternative cofactor NADP would
of function that form dynamic networks. (a) The
not. It would be assigned EC 1.1.1.2.
enzymes that run the biochemistry of the cell.
The first field in an EC number indicates to which
(b) Regulatory networks exercise control to provide
of the six main divisions (classes) the enzyme belongs:
stability and robustness.
Class 1. Oxidoreductases
The Enzyme Commission Class 2. Transferases
The first detailed classification of protein functions
Class 3. Hydrolases
was that of the Enzyme Commission (EC). In 1955,
Class 4. Lyases
the General Assembly of the International Union of
Biochemistry (IUB), in consultation with the Interna- Class 5. Isomerases
tional Union of Pure and Applied Chemistry (IUPAC), Class 6. Ligases
established an International Commission on Enzymes,
The significance of the second and third numbers
to systematize nomenclature. The Enzyme Commis-
depends on the class. For oxidoreductases the second
sion published its classification scheme, first on paper
number describes the substrate and the third number
and now on the web: http://www.chem.qmul.ac.uk/
the acceptor. For transferases, the second number
iubmb/enzyme/.
describes the class of item transferred, and the third
EC numbers (looking suspiciously like IP numbers)
number describes either more specifically what they
contain four numeric fields, corresponding to a four-
transfer or in some cases the acceptor. For hydrolases,
level hierarchy. For example, EC 1.1.1.1 corresponds
the second number signifies the kind of bond cleaved
to the reaction:
(e.g. an ester bond) and the third number the molecu-
an alcohol + NAD = the corresponding aldehyde lar context (e.g. a carboxylic ester or a thiolester).
or ketone + NADH2 (Proteinases, a type of hydrolase, are treated slightly
352 11 Systems Biology
differently, with the third number including the possibly in concert with other proteins or RNA
mechanism: serine proteinases, thiol proteinases, and molecules; either a general term such as signal
acid proteinases are classified separately.) For lyases transduction, or a particular one such as cyclic
the second number signifies the kind of bond formed AMP synthesis. This is function from the cell’s
(e.g. C–C or C–O), and the third number the specific point of view.
molecular context. For isomerases, the second number
Because many processes are dependent on location,
indicates the type of reaction and the third number
Gene Ontology (GO) also tracks:
the specific class of reaction. For ligases, the second
number indicates the type of bond formed and the • Cellular component: the assignment of site of
third number the type of molecule in which it appears. activity or partners; this can be a general term such
For example, EC 6.1 for C–O bonds (enzymes acyla- as nucleus or a specific one such as ribosome.
ting tRNA), EC 6.2 for C–S bonds (acyl-CoA deriva- Figure 11.1 shows an example of the GO
tives), etc. The fourth number gives the specific classification.
enzymatic activity. Neither the EC nor the GO classification is an
The Enzyme Structures Database at PDBe links assignment of function to individual proteins. The
Enzyme Commission numbers to proteins of known EC emphasized that: ‘It is perhaps worth noting, as it
structure (http://www.ebi.ac.uk/thornton-srv/ has been a matter of long-standing confusion, that
databases/enzymes/). enzyme nomenclature is primarily a matter of nam-
ing reactions catalysed, not the structures of the
The Gene Ontology™ Consortium protein
proteins that catalyse them’. (See http://www.chem.
function classification
qmul.ac.uk/iubmb/nomenclature/.)
In 1999, Michael Ashburner and many co-workers Assigning EC or GO numbers to proteins is a
faced the problem of annotating the soon-to-be- separate task. Such assignments appear in protein
completed Drosophila melanogaster genome sequence. databases such as UniProtKB.
As a classification of function, the EC classification
was unsatisfactory, if only because it was limited to Comparison of Enzyme Commission and Gene
enzymes. Ashburner organized the Gene Ontology™ Ontology classifications
Consortium to produce a standardized scheme for Enzyme Commission identifiers form a strict four-
describing function. level hierarchy, or tree. For example, isopentenyl-
diphosphate d-isomerase is assigned EC number
• An ontology is a formal set of well-defined terms with 5.3.3.2. The initial 5 specifies the most general cate-
well-defined interrelationships; that is, a dictionary and gory, 5 = isomerases, 5.3 comprises intramolecular
rules of syntax. isomerases, 5.3.3 those enzymes that transpose C=C
bonds, and the full identifier 5.3.3.2 specifies the par-
The Gene Ontology™ Consortium (http://www. ticular reaction. In the molecular function ontology,
geneontology.org) has produced a systematic classifica- GO assigns the identifier 0004452 to isopentenyl-
tion of gene function, in the form of a dictionary of diphosphate Δ-isomerase. (The numerical GO iden-
terms, and their relationships. tifiers themselves have no interpretable significance.)
Organizing concepts of the Gene Ontology project Figure 11.2 compares the EC and GO classifica-
include three categories: tions of isopentenyl-diphosphate d-isomerase. The
figure shows a path from GO:0004452 to the root
• Molecular function: a function associated with node of the molecular function graph, GO:0003674.
what an individual protein or RNA molecule does In this case there are four intervening nodes, progres-
in itself; either a general description such as enzyme, sively more general categories as we move up the
or a specific one such as alcohol dehydrogenase. figure. Note that the GO description of this enzyme
(specifying a catalytic activity, not a protein). This as an oxidoreductase is inconsistent with the EC
is function from the biochemist’s point of view. classification, in which a committed choice between
• Biological process: a component of the activities oxidoreductase and isomerase must be made at the
of a living system, mediated by a protein or RNA, highest level of the EC hierarchy.
The metabolome 353
Molecular function
Lamin/chromatin ATP-dependent
(a) binding DNA helicase
DNA metabolism
DNA degradation DNA packaging DNA replication DNA repair DNA recombination
Mitochondrial DNA-dependent
genome maintenance DNA replication
Figure 11.1 Selected portions of the three categories of Gene Ontology, showing classifications of functions of proteins that interact
with DNA.
(a) Biological process: DNA metabolism.
(b) Molecular function: including general DNA binding by proteins, and enzymatic manipulations of DNA.
(c) Cellular component: Different places within the cell.
These pictures illustrate the general structure of the Gene Ontology classification. Each term describing a function is a node in a graph.
Each node has one or more parents and may have one or more descendants: arrows indicate direct ancestor–descendant relationships.
A path in the graph is a succession of nodes, each node the parent of the next. Nodes can have ‘grandparents’, and more remote
ancestors.
Unlike the EC hierarchy, the Gene Ontology graphs are not trees in the technical sense, because there can be more than one path
from an ancestor to a descendant. For example, there are two paths in (a) from enzyme to ATP-dependent helicase. Along one path
helicase is the intermediate node. Along the other path adenosine triphosphatase is the intermediate node.
Although the nodes are shown on discrete levels to clarify the structure of the graph, all the nodes on any given level do not
necessarily have a common degree of significance; unlike family, genus, and species levels in the Linnaean taxonomic tree, or the ranks
in military, industrial, academic, etc. organizations. GO terms could not have such a common degree of significance, given that there
can be multiple paths, of different lengths, between different nodes.
354 11 Systems Biology
Cell
Cytoplasm Nucleus
Table 11.1 Databases of metabolic pathways To appreciate the logic of the system from dia-
grams such as Figure 11.3, keep in mind that both
Database Home page
the reaction sequence and the control cascades are
EcoCyc http://ecocyc.org embedded in much larger networks.
BioCyc http://www.biocyc.org • The first step, phosphorylation of l-aspartate, is
KEGG http://www.genome.jp/kegg/ common to the biosynthesis of methionine, lysine,
WIT www.mcs.anl.gov/compbio and threonine. E. coli contains three aspartate
kinases, encoded by three separate genes, each
specific for one of the end-product amino acids.
other databases that provide different data selections They catalyse the same reaction but are subject to
and different modes of organization. EcoCyc deals separate regulation.
with E. coli. It is the model for – and linked with –
• The third step, conversion of l-aspartate-
numerous parallel databases, with uniform web inter-
semialdehyde to l-homoserine, is common to the
faces, treating other organisms. BioCyc is the ‘umbrella’
methionine and threonine synthesis pathways.
collection. KEGG, the Kyoto Encyclopedia of Genes
Two homoserine dehydrogenases are separately
and Genomes, contains information from multiple
encoded. Regulation of expression of the aspartate
organisms. WIT contains metabolic reconstructions
kinases and homoserine dehydrogenases suffices to
derived from genome sequences (Table 11.1).
control all three pathways.
EcoCyc • The piece of the regulatory network is also ex-
EcoCyc is a database representing what we know tracted from a more complex tapestry. For example,
about the biology of E. coli, strain K-12 MG1655. CRP (catabolite repressor protein) regulates more
It contains: than 200 genes!
• Methionine is converted to S-adenosylmethionine,
• the genome: the complete sequence, and for each
a common participant in methyl group transfers.
gene its position and function if known;
S-Adenosylmethionine activates the Met repressor
• transcription regulation: operons, promoters, and (encoded by metJ) (see Figure 11.3). This is a more
transcription factors and their binding sites; complicated form of feedback. In classic feedback
• metabolism: the pathways, including details of the inhibition, a product interacts directly with an
enzymology of individual steps; for each enzyme enzyme that produces one of its precursors. In this
the reaction, activators, inhibitors, and subunit case, the product interacts with a repressor, which
structure are given; reduces the expression of enzymes that produce its
• membrane transporters: transport proteins and precursors. (See page 47.)
their cargo; and In the EcoCyc web page that contains the informa-
• links to other databases: protein and nucleic-acid tion corresponding to this figure, the items are active.
sequence data, literature references, and compari- Links to other internal pages expand information
sons to different E. coli strains. about metabolites, cofactors, enzymes, genes, and
regulators. It is possible to ‘zoom’ in or out by con-
trolling the level of detail. For instance, asking for
Methionine synthesis in Escherichia coli
less detail than the contents of Figure 11.3(a) would
A tiny subset of the E. coli metabolic network is the first eliminate the information about the genes and
pathway for synthesis of methionine from aspartate enzymes and then reduce the pathway to an outline
(see Figure 11.3). showing only critical intermediates:
(a) L-Aspartate
• Readers are urged to explore the EcoCyc web site on
metL Aspartate kinase
their own, either deliberately or serendipitously, or
guided by weblems in this chapter or at the Online
L-Aspartate 4-phosphate
Resource Centre.
Aspartate semialdehyde
asd
dehydrogenase
L-Aspartate-semialdehyde
It is also possible to explore in other dimensions.
metL Homoserine dehydrogenase II The methionine synthesis pathway is embedded in
larger networks. One of these involves synthesis of
L-Homoserine
the amino acids lysine and threonine in addition to
metA Homoserine O-succinyltransferase methionine, all starting with aspartate (see Figure 11.4).
O-Succinyl-L-homoserine
metB O-Succinylhomoserine(thiol)lyase −
The Kyoto Encyclopedia of Genes and
Genomes (KEGG)
Cystathione
The Kyoto Encyclopedia of Genes and Genomes
metC/malY Cystathione-β-lyase (KEGG) is an extremely comprensive battery of data-
bases for molecular biology and genomics. One of
L-Homocysteine
its special strengths is an integration of metabolic
metE/metH L-Homocysteine transmethylase and genomic information. KEGG contains path-
way maps, which describe potential networks of
Methionine
molecular activities, both metabolic and regulatory.
Figure 11.5 shows a pathway from KEGG, the
(b) reductive carboxylate cycle in photosynthetic bacte-
SoxP CRP MalI ria. This pathway is basically the Krebs cycle, run
backwards.
MetP
Figure 11.3 (a) The synthesis of methionine from aspartate is a Circles contain the genes for the transcription factors that control
seven-step pathway through a linear sequence of intermediates expression. Regulated genes appear in rectangles. Molecules that
(black). Different enzymes (green) catalyse different steps. They tend to enhance transcription are connected to their targets by
are encoded by the genes shown in blue. metC and malY encode green arrows. Molecules that tend to repress transcription are
alternative cystathione-b-lyases. The final step, conversion of connected to their targets by red lines ending in a ‘T’. In addition
L-homocysteine to L-methionine, is also catalysed by two different to the links shown here, most of the regulatory proteins feed
L-homocysteine transmethylases, encoded by two genes, metE back on themselves; in most cases, the self-regulatory signal is
and metH. One mechanism of control is at the protein level: there a repression.
is ‘feedback inhibition’ by the product, methionine, which inhibits Control is exerted on every protein of the pathway, from a
homoserine O-succinyltransferase. This is shown by the red line; variety of points of initiation.
The metabolome 357
L-Aspartate
• Several databases assemble biochemical reactions into
metabolic pathways. Individual steps are linked to
Enzyme Commission and Gene Ontology Consortium
classifications of function, and to individual proteins
L-Aspartate-semialdehyde that catalyse the reactions. These databases are useful
in organizing the assignment of function to proteins
identified in newly sequenced genomes.
Homoserine
Evolution and phylogeny of metabolic
pathways
Most organisms share many common metabolic
pathways. But there are many individual variations.
L-Threonine Some organisms have metabolic competence com-
pletely absent from others. Plants but not humans
L-Homocysteine
have enzymes for reactions involved in photosynthe-
sis and cell-wall formation.
L-Methionine Some organisms achieve the same overall meta-
bolic transformation but use alternative pathways;
L-Lysine
that is, different sets of intermediates. For instance,
classical glycolysis and the Entner–Doudoroff path-
Figure 11.4 The pathway of amino acid biosynthesis from
way are alternative routes from glucose to pyruvate,
aspartate branches after aspartate semialdehyde. In this figure,
the black sequence corresponds to the previous example, and
in which there is a whole succession of reactions pos-
the green pathways are the immediate context. The aspartate sible (Figure 11.6). Often, organisms will share many
→ methionine sequence is a subnetwork of the network shown steps in a metabolic transformation but some will
here. Each amino acid plays a regulatory role, exerting feedback extend or truncate the pathway. Humans have lost
inhibition over its own synthesis, without affecting the others. activity in the last enzyme in vitamin C synthesis,
It looks as if threonine and lysine both individually inhibit the first
l-gulonolactone oxidase, and we must include it in
step of the synthesis of all three products, but this step is catalysed
by three separate aspartate kinases, allowing specialized regulation. our diet. Most mammals have a working copy of the
gene for this protein, and can synthesize vitamin C.
KEGG derives its power from the very dense We have an inactive pseudogene.
network of links among these categories of informa- Another example is the pathway for nitrogen
tion, and additional links to many other databases excretion (see Figure 11.7). Organisms with more
to which the system maintains access. Two examples water available in their immediate surroundings use
of the kinds of questions that can be treated with more of the reactions.
KEGG are: We can represent the metabolic networks of differ-
ent species as graphs. The nodes are metabolites.
(a) It has been suggested that simple metabolic There are edges between pairs of metabolites if the
pathways evolve into more complex ones by gene organism has an enzyme that will convert one to
duplication and subsequent divergence. Searching the other, or if the interconversion is spontaneous.
the pathway catalogue for sets of enzymes that We can then compare the graphs to get a quantitative
share a folding pattern will reveal clusters of measure of the divergence. Intuitively, we expect that
linked paralogues. the divergence in metabolic network should corre-
(b) KEGG can take the set of known enzymes from spond to the divergence between species as measured
some organism and check whether they can be from comparing genome sequences.
integrated into established metabolic pathways. The procedure outlined here deals with a static and
A gap in a pathway suggests a missing enzyme or binary picture of the metabolic network. Either a
an unexpected alternative pathway. transformation is possible, or it is not. It is entirely
358 11 Systems Biology
CO2
Carbon fixation
Phosphoenolpyruvate
2.7.9.2
L-Alanine
Alanine and aspartate
Pyruvate 1.4.1.1
metabolism
4.1.1.31 1.2.7.1
CO2
Acetate
Sulphur
Acetyl-CoA 6.2.1.1
metabolism
2.3.3.8 Citrate
4.2.1.2 4.2.1.3
Fumarate
Isocitrate
1.3.99.1
1.1.1.42
Succinate CO2
CO2 2-Oxoglutarate
6.2.1.5
1.2.7.3
Succinyl-CoA
Glutamate
Reduced ferredoxin metabolism
Figure 11.5 Metabolic pathway map from The Kyoto Encyclopedia of Genes and Genomes (KEGG). This figure shows the reductive
carboxylate cycle, and its links to other metabolic processes. The numbers in square boxes are EC numbers identifying the reactions at
each step.
possible that enzymes that catalyse corresponding B. Siebers and P. Schönheit have studied the
steps in the network have very different kinetic con- metabolic pathways of carbohydrate metabolism in
stants in two species, or are subject to different kinds archaea. In the initial conversion of glucose to pyruvate,
of regulation. In this case the dynamic patterns of they observed a number of differences in the pathway,
traffic through the network might be quite different, from either the standard Embden–Meyerhof glyco-
even if the topology of the network is the same. lytic pathway, or the Entner–Doudoroff alternative.
Think of the difference in traffic flow through a city Pyrococcus furiosus, Thermococcus celer, Arch-
during rush hour, and at midnight. The roads haven’t aeoglobus fulgidus strain 7324, Desulfurococcus
changed, but the kinetics has. amylolyticus, and Pyrobaculum aerophilum use a
modified Embden–Meyerhof pathway (Figure 11.8).
Sulfolobos solfataricus and Haloarcula marismortui
Carbohydrate metabolism in archaea
use a modified Entner–Doudoroff pathway (Figure
The common pathway from glucose to pyruvate in 11.9). Thermoproteus tenax uses both.
bacteria and eukaryotes is the Embden–Meyerhof In addition to the differences in the sequence of
glycolytic route (see Figure 11.6). metabolites, the enzymes that catalyse even the same
The metabolome 359
glucose-6-P glucose-6-P
fructose-6-P 6-P-gluconate
fructose-1,6-bisP 2-keto-3-deoxy-6-P-gluconate
2 (1,3-bisphosphoglycerate) 1,3-bisphosphoglycerate
2 (3-phosphoglycerate) 3-phosphoglycerate
2 (2-phosphoglycerate) 2-phosphoglycerate
2 (phosphoenolpyruvate) phosphoenolpyruvate
Figure 11.6 (a) Embden–Meyerhof glycolytic pathway, (b) Entner–Doudoroff pathway. Note that the enzymatic conversion of
glyceraldehyde-3-phosphate to pyruvate is the same in both pathways (green branch).
Uric acid (primates, birds) reactions are almost always not homologues of bac-
urate oxidase terial or eukaryotic ones. Many of them use different
cofactors. Bacterial and eukaryotic phosphofructoki-
nases (that convert fructose-6-phosphate to fructose-
Allantoin (some mammals)
1,6,bisphosphate) use ATP as the phosphoryl donor.
alantoinase The archaeal enzymes that catalyse this reaction can
use ATP, ADP, or even inorganic pyrophosphate. In
Allantoate (bony fish) addition, some of the familiar enzymes are under
alantoicase allosteric control. The control relationships are not
retained in the corresponding archaeal enzymes.
Urea (amphibians)
• Of particular interest for comparative genomics are
urease
facilities to compare pathways among different organ-
isms. Alignment and comparison of pathways can
Ammonia (marine invertebrates) expose how pathways have diverged between species.
Even if the pathways are the same, in some cases the
Figure 11.7 Succession of reactions to produce excreted forms of
enzymes are non-homologous.
end products of nitrogen metabolism.
360 11 Systems Biology
Glucose
NAD(P)+
1
NAD(P)H
(b) Gluconate (a)
2
ADP ATP H2O
KDPG KDG
8
3 3
Pyruvate Pyruvate
GAP GA
NAD(P)+ + Pi NAD(P)+, Fdox
NAD(P)+ 9 NAD(P)H 4
NAD(P)H, Fdred
NAD(P)H 1.3 BPG
ADP Glycerate
11 10 ATP
5
3 PG 2 PG
12
6 ADP ATP
H2O
PEP
ADP
7
ATP
Pyruvate
Figure 11.8 Modifications of the Entner–Doudoroff (ED) pathway in Archaea. (a) The non-phosphorylative ED pathway in
Thermoplasma acidophilum. (b) The semi-phosphorylative ED pathway in halophilic Archaea. A branched ED (combining (a) and (b))
appears in S. solfataricus and T. tenax. Abbreviations: 1.3 BPG, 1,3-bisphosphoglycerate; Fdox and Fdred, oxidized and reduced
ferredoxin; GA, glyceraldehyde; GAP, glyceraldehyde-3-phosphate; KDG, 2-keto-3-deoxygluconate; KDPG, 2-keto-3-deoxy-6-
phosphogluconate; PEP, phosphoenolpyruvate; 2 PG, 2-phosphoglycerate; 3 PG, 3-phosphoglycerate. Enzymes are numbered as
follows: 1, glucose dehydrogenase; 2, gluconate dehydratase; 3, KD(P)G aldolase; 4, glyceraldehyde dehydrogenase (proposed for
T. acidophilum), glyceraldehyde:ferredoxin oxidoreductase (proposed for T. tenax) or glyceraldehyde oxidoreductase (proposed
for S. acidocaldarius); 5, glycerate kinase; 6, enolase; 7, pyruvate kinase; 8, KDG kinase; 9, GAPDH; 10, phosphoglycerate kinase; 11,
GAPN; 12, phosphoglycerate mutase.
From: Siebers, B. & Schönheit, P. (2005). Unusual pathways and enzymes of central carbohydrate metabolism in Archaea. Curr. Opin. Micro. 8, 695–705.
Pyrocococcus
Thermococcus
Archaeoglobus 7324 Desulfurococcus Pyrobaculum Thermoproteus
Glucose
ADP ATP ATP ATP
GLK
AMP ADP ADP ADP
G-6-P
DHAP GAP
3-PG
PGM
2-PG
Enolase
ADP ADP ADP ADP
PEP
ATP ATP ATP ATP
PK
Pyruvate
Figure 11.9 Modifications of the Embden–Meyerhof (EM) pathway in Archaea. In this case most of the reactions are the same.
The enzymes are not homologous to those that catalyse the corresponding reactions in bacteria and eukarya. Note the differences
in cofactors. The steps and mechanisms of regulation also differ. Abbreviations: aFBA, archaeal class I FBA; cPGI, cupin PGI; DHAP,
dihydroxyacetone phosphate; FBA, fructose 1,6-bisphosphatae aldolase; F-1,6-BP, fructose 1,6-bisphosphate; Fdox and Fdred, oxidized
and reduced ferredoxin; F-6-P, fructose-6-phosphate; GAP, glyceraldehyde-3-phosphate; GAPN, non-phosphorylative glyceraldehyde
3-phosphate dehydrogenase; GAPOR, glyceraldehyde-3 phosphate-ferredoxin oxidoreductase; GLK, glucokinase (ADP- or ATP-
dependent); G-6-P, glucose-6-phosphate; PEP, phosphoenolpyruvate; PFK, 6-phosphofructokinase; 2-PG, 2-phosphoglycerate; 3-PG,
3-phosphoglycerate; PGI/PMI, bifunctional phosphoglucose/phosphomannose isomerase); PGI, phosphoglucose isomerase; PGM,
phosphoglycerate mutase; PK, pyruvate kinase; TIM, triosephosphate isomerase.
From: Siebers, B. & Schönheit, P. (2005). Unusual pathways and enzymes of central carbohydrate metabolism in Archaea. Curr. Opin. Micro. 8, 695–705.
for a shikimate kinase in A. pernix and to identify a shikimate kinase. It has no sequence similarity to
a homologue of that gene in M. jannaschii. bacterial or eukaryotic shikimate kinases. A protein
Experiments confirmed the prediction that the from a different family has been recruited for the
M. jannaschii gene thus identified (MJ1440) encoded archaeal pathway.
Regulatory networks
Regulatory networks pervade living processes. Control the resting state (see Figure 11.10). Many regulatory
interactions are organized into linear signal transduc- actions are mediated by protein–protein complexes.
tion cascades and reticulated into control networks. Transient complexes are common in regulation, as
Any individual regulatory action requires (1) a dissociation provides a natural reset mechanism.
stimulus; (2) transmission of a signal to a target; Some stimuli arise from genetic programmes. Some
(3) a response; and (4) a ‘reset’ mechanism to restore regulatory events are responses to current internal
362 11 Systems Biology
An unlabelled, undirected graph gives a static picture independent actions of all of the individual signals
of the topology of a network. The dynamic states are combine to achieve an overall, integrated result. It is
more complex (see Box 11.7), including: like the operation of the ‘invisible hand’ that, accord-
ing to Adam Smith, coordinates individual behaviour
• equilibrium;
into the regulation of national economies.
• steady state;
• states that vary periodically;
• Robustness is more than stability. Stability is keeping
• unfolding of developmental programmes; your composure in unchanging conditions. Robustness
• chaotic states; is keeping your composure in changing conditions.
• runaway or divergence; and
• shutdown. Robustness through redundancy
Although much is known about the mechanisms of In principle, networks can achieve robustness through
individual elements of control and signalling path- an extension of the mechanism by which redundancy
ways, understanding their integration is a subject of confers stability. The most direct approach is simple
current research. For instance, the idea that healthy substitutional redundancy: if two proteins are each
cells and organisms are in stable states is certainly no capable of doing a job, knock out one and the other
more than an approximation (in most cases, it is an takes over. In the London Underground, this would
idealization). correspond to a second line running over the same
Understanding how cells achieve even an apparent route. For instance, when the Circle Line is not run-
approximation to stability is also quite tricky. It is ning, passengers travelling between Paddington and
likely that great redundancy of control processes lies King’s Cross stations can use the Hammersmith &
at its basis. Regulation is based on the result of many City line that runs on the same tracks. In yeast, for
individual control mechanisms – here a short feed- example, single-gene knockouts of over 80% of the
back loop, there a multistep cascade. Somehow the ∼6200 open reading frames are survivable injuries.
364 11 Systems Biology
• At equilibrium, one or more forward and reverse pro- • Many equilibrium and some steady-state conditions are
cesses occur at compensating rates, to leave the amounts stable, in the sense that concentrations of most metabo-
of different substances unchanged: lites are changing slowly if at all and the system is robust
A B to small changes in external conditions. The alternative is
a chaotic state, in which small changes in conditions can
Chemical equilibria are generally self-adjusting upon cause very large responses. Weather is a chaotic system:
changes in conditions or in concentrations of reactants the meteorologist E. Lorenz asked, ‘Does the flap of a
or products. butterfly’s wings in Brazil set off a tornado in Texas?’ In
• A steady state will exist if the total rate of processes that a carefully regulated system, chaos is usually well worth
produce a substance is the same as the total rate of avoiding, and it is likely that life has evolved to damp
processes that consume it. For instance, the two-step down the responses to the kinds of fluctuation that
conversion might give rise to it. Chaotic dynamics does sometimes
A B C produce approximations to stable states – these are called
strange attractors. Understanding stability in dynamic
could maintain the amount of B constant, provided that systems subject to changing environmental stimuli is
the rate of production of B (the process A → B) is the important but is beyond the scope of this book.
same as the rate of its consumption (the process B → C).
• Unfolding of developmental programmes occurs over
The net effect would be to convert A to C.
the course of the lifetime of the cell or organism. Many
A cyclic process could maintain a steady state in all of
developmental events are relatively independent of
its components:
external conditions and are controlled primarily by regu-
B lation of gene expression patterns.
• Runaway or divergence. Absence of a predator can lead
A C
to uncontrolled multiplication of a species. An example is
A steady state in such a cyclic process with all reactions the growth of the rabbit population of Australia from an
proceeding in one direction is very different from an ‘inoculum’ of 24 animals in 1859. Breakdown in control
equilibrium state. Nevertheless, in some cases, it is still over cellular proliferation leads to unconstrained growth
true that altering external conditions produces a shift to in cancer.
another, neighbouring steady state. • Shutdown is part of the picture. Apoptosis is the pro-
• States that vary periodically appear in the regulation of grammed death of a cell, as part of normal develop-
the cell cycle, circadian rhythms, and seasonal changes mental processes or in response to damage that could
such as annual patterns of breeding in animals and flow- threaten the organism, such as DNA strand breaks.
ering in plants. Circadian and seasonal cycles have their Breakdown of mechanisms of apoptosis – for instance,
origins in the regular progressions of the day and year, mutations in the protein p53 – is an important cause of
but have evolved a certain degree of internalization. cancer.
Some duplicated genes contribute to substitutional Coordinated expression patterns, providing sub-
redundancy. For example, in studying models for dia- stitutional redundancy, are more probable among
betes it appears that mice and rats (but not humans) duplicated genes than among unrelated ones. For
have two similar but non-allelic insulin genes. Sub- example, Escherichia coli contains two fructose-1,6-
stitutional redundancy requires equivalence not only bisphosphate aldolases. One, expressed only in the
of function but of expression levels. In the mouse, presence of special nutrients, is non-essential under
knocking out either insulin gene leads to compensa- normal growth conditions. However, the other is
tory increased expression of the other, producing a essential. In this case, functional redundancy does not
normal phenotype. provide robustness. These two enzymes are probably
Dynamics, stability, and robustness 365
homologous, but they are distant relatives, not the disease its name. (Phenylalanine is not a ketone.)
the product of a recent gene duplication. One is a The Guthrie test for phenylketonuria measures the
member of a family of fructose-1,6-bisphosphate concentration of phenylpyruvic acid in the blood of
aldolases typical of bacteria and eukaryotes, whereas newborns.
the other is a member of another family that occurs A challenge greater than predicting the effect of a
in archaea. E. coli is unusual in containing both. single knockout would be to simulate the entire
An alternative mechanism of network robustness metabolic network: given an initial set of metabolite
is distributed redundancy: equivalent effects achieved concentrations, to predict the concentrations as a
through different routes. In normal E. coli, approxi- function of time. The idea would be to combine pre-
mately two-thirds of the NADPH produced in meta- dictions of the rates of individual reactions, assuming
bolism arises via the pentose phosphate shunt, which a simple model such as Michaelis–Menten kinetics,
requires the enzyme glucose-6-phosphate dehydro- or more complex models of allosteric enzymes. This
genase. Knocking out the gene for this enzyme leads requires knowing accurately the kinetic constants
to metabolic shifts, after which increased levels of of all of the enzymes, including effects of inhibitors.
NADH produced by the tricarboxylic acid cycle It requires being able to give a sensible treatment
are converted to NADPH by a transhydrogenase of the idea of ‘substrate concentration’ within a cell
reaction. The growth rate of the knockout strain is divided into compartments and to deal with ques-
comparable to that of the parent. tions of rates of diffusion in a crowded intercellular
environment. Longer-term simulation would require
Dynamic modelling knowing the kinetics of transcription regulation, for
which no simple model analogous to the Michaelis–
Diagrams such as Figure 11.3 give a static picture of
Menten equation is available. There are also serious
the structure of a metabolic pathway and its control.
computational issues involving how precisely the
Can we model the dynamics? What would it mean
kinetic parameters must be known, and the extent to
to do so?
which simplifying assumptions – for instance, the
A challenge that might – naively – appear relatively
steady-state approximation – are justified.
simple would be to predict the effect of knocking out
Accurate simulation of metabolic patterns of
an enzyme. An easy guess would be to expect a build-
entire cells is a clear target for research in the field.
up of the substrate of the missing enzyme. However,
However, the problem is a difficult one. Current
if the metabolic pathways branch in the vicinity of
approaches include:
that metabolite, the consequences of a knockout are
more complex. • Attempts at detailed numerical analysis of simple
For example, the disease phenylketonuria results networks. For instance, a simulation of the
most commonly from a specific dysfunctional (i.e. asparate → threonine pathway (see Figure 11.4) in
knocked-out) enzyme, phenylalanine hydroxylase. E. coli represented the enzymatic transformations
The normal function of phenylalanine hydroxylase is and feedback inhibition as a set of coupled equa-
to convert phenylalanine to tyrosine (see page 60). tions.* Changes in expression pattern were not
In phenylketonuria, phenylalanine does indeed build included. Steady-state solutions were compared
up. However, the excess phenylalanine is converted by with experimental measurements on cell extracts.
phenylalanine transaminase to phenylpyruvic acid: It was possible to:
– simulate the time course of threonine synthesis
COOH COOH and the effects of changes in initial metabolite
concentrations;
NH2 O
– predict the steady-state concentrations of
phenylalanine phenylpyruvic acid intermediates;
– predict the effects of changes in concentrations – the thermodynamic properties of each reaction
of individual enzymes on overall throughput, determine whether or not the reaction is rever-
expressed as flux control coefficients; such data sible: this is a property of the substrate and
can help to guide development of microbial fac- product of the reaction, not of the enzyme; the
tories for increased yield of particular products; flux of an irreversible reaction must be ≥0.
– for different steps, distinguish whether the
substrates and products are approximately at
• It is interesting to see whether the space of possible
equilibrium. metabolic states is connected or broken up into sepa-
rated regimens.
• The flux control coefficient is the percentage change
in flux divided by the percentage change in amount
of enzyme. It is not a property of the enzyme, but In general, many possible flow patterns, or metabolic
a property of a reaction within a metabolic network. states, are consistent with the constraints. To deter-
A flux control coefficient equal to 1 would correspond mine a single metabolic state to compare with experi-
to a rate-limiting step.
ments, it is possible to select from the feasible states
the one that is optimal for ATP production or for
• Focusing not on individual enzymes but on poten- growth rate.
tial sets of flow rates. Represent the metabolic A variety of observable quantities are predictable.
network as a graph. Metabolites are the nodes.
Edges correspond to reactions: an edge connects • The effects of changes of medium or gene knock-
two compounds if there is a reaction, or possibly outs: which enzymes are essential for growth on
several reactions, that interconvert them. The goal different carbon sources?
is to predict the flow rate through each edge. • What are limiting factors in growth?
Recently the models have been generalized to • What are maximal theoretical yields of ATP, or
include regulation of expression. There are general assimilation of carbon, etc.?
constraints on the set of flow rates:
• What are the fluxes through individual pathways?
– under steady-state conditions, the fluxes through This is difficult but not impossible to measure.
each node must add up to zero; i.e. for each
• What are the flux control coefficients of different
compound, the amount that is synthesized or
enzymes?
supplied externally must equal the amount used
up or secreted; • For optimal growth, how much oxygen and carbon
source are taken up?
– the flux control coefficients of all of the reactions
contributing to a single flux must add up to 1; Such models have been constructed for several organ-
– the flux through any edge is limited by the values isms, including prokaryotes and eukaryotes. Predic-
of the Michaelis–Menten parameter Vmax for all tions have generally achieved good agreement with
enzymes contributing to the edge; and experiments.
The units from which interaction networks are change in external conditions, or by the activity of
assembled are: another process.
• for physical networks, a protein–protein or protein– Most experiments reveal only pairwise interactions.
nucleic acid complex; The challenges are to integrate pairwise interactions
• for logical networks, a dynamic connection in into a network and then to study the structure and
which the activity of a process is affected by a dynamics of the system.
Protein interaction networks 367
• Coimmunoprecipitation. An antibody raised to a Figure 11.11 (a) X-ray image of the microtubule network of a
‘bait’ protein binds the bait together with any other mouse epithelial cell labelled using metal-conjugated antibodies
(to form individual particles ∼50 nm in diameter). Different
‘prey’ proteins that interact with it. The interacting
regions of the cell have measurably different X-ray absorbances.
proteins can be purified and analysed, for instance Colouring the image according to X-ray absorbance brings out
by western blotting, or mass spectrometry. contrasts; of course, there is no suggestion that the colours
• Chromatin immunoprecipitation identifies DNA correspond to a realistic interaction with visible light. Here the
microtubule network appears in blue, and the nucleus and nucleoli
sequences that bind proteins (Figure 11.13).
in orange. The total width of the field is 120 mm. (b) Cryo X-ray
• Phage display. Genes for a large number of pro- tomography of a yeast cell (S. cerevisiae). This image shows a
teins are individually fused to the gene for a phage 0.5 mm section at 60 nm resolution. Lipid droplets are coloured
coat protein, to create a population of phage each white, the vacuole and nucleus are red. The arrow points to the
nucleus. Other cytoplasmic structures appear green and orange.
of which carries copies of one of the extra proteins
Cell diameter, 5 mm.
exposed on its surface. Affinity purification against
(a) From: Meyer-Ilse, W., et al. (2001). High resolution protein
an immobilized ‘bait’ protein selects phage dis- localization using soft X-ray microscopy. J. Microsc. 201, 395–403.
playing potential ‘prey’ proteins. DNA extracted (b) From: Larabell, C.A. & Le Gros, M.A. (2004). X-ray tomography
generates 3-D reconstructions of the yeast, Saccharomyces cerevisiae,
from the interacting phages reveals the amino acid
at 60-nm resolution. Mol. Biol. Cell 15, 957–962.
sequences of these proteins.
• Surface plasmon resonance analyses the reflection
of light from a gold surface to which a protein has
been attached. The signal changes if a ligand binds
368 11 Systems Biology
Reporter gene
(a)
Cross-link:
B P TA
Transcription
DNA-binding
Bait +
TAP tag
First elution
IgG-coated
Bait beads
cleavage by
TEV protease Second elution
Bait Calmodulin-coated
beads
Bait
Figure 11.14 The construct and protocol in the tandem affinity purification (TAP) method for purification of complexes with a selected
‘bait’ protein. The fusion protein containing the bait and the two tags separated by the TEV cleavage site binds in vivo to proteins in the
cell. A first affinity purification step binds the bait protein to a column containing IgG-coated beads that bind specifically to the first tag,
protein A. After thorough washing, cleavage by TEV protease releases the bound complexes, and exposes the second tag. A second
affinity purification step binds the bait protein to a column containing calmodulin-coated beads, which bind specifically to the second
tag, the calmodulin-binding peptide. After washing, elution with the chelating agent ethylene glycol tetraacetic acid (EGTA) releases the
purified complexes.
response to isoniazid. AHPC acts to relieve oxida- interaction domains, which have diverged to form
tive stress. There is no evidence that it physically large families with different individual specificities.
interacts with the Fatty Acid Synthesis complex, or For instance, the human genome contains 115 SH2
that it mediates a metabolic transformation cou- domains, and 253 SH3 domains. (Src-Homology
pled to fatty acid synthesis. It is a second, indepen- domains SH2 and SH3 are named for their homolo-
dent, component of the response to isoniazid. gies to domains of the src family of cytoplasmic tyro-
• Phylogenetic distribution patterns. The phyloge- sine kinases.) Many individual interaction domains
netic profile of a protein is the set of organisms in even interact with different partners as they partici-
which it and its homologues appear. Proteins in a pate in successive steps of a control cascade. Initial
common structural complex or pathway are func- interactions may also trigger recruitment of addi-
tionally linked and expected to co-evolve. There- tional proteins to form large regulatory complexes.
fore proteins that share a phylogenetic profile are Figure 11.15 shows types of interaction domain
likely to have a functional link, or at least to have complexes with ligands, including binding of
a common subcellular origin. There need be no peptides (which may be attached to proteins), and
sequence or structural similarity between the pro- protein–protein complexes. Protein–nucleic acid
teins that share a phylogenetic distribution pat- complexes will appear next.
tern. A welcome feature of this method is that it Many interaction domains are sensitive to the state
derives information about the function of a protein of post-translational modification of their ligands, for
from its relationship to nonhomologous proteins. instance binding preferentially to states of a ligand
in which specific tyrosines, serines, or threonines are
Each of these methods provides a basis for a phosphorylated. These and other post-translational
protein interaction network. The networks formed modifications function as switches, turning on or
by combining each set of interactions are different, interrupting/resetting a signalling cascade.
although they overlap, to a greater or lesser extent. Protein–protein complex formation allows a cell
They give different views of the kinds of relationships to detect a signal molecule in the external medium
between proteins that exist in cells. It is possible to and report its arrival to the cell interior, without the
form a more comprehensive network by combining signal molecule itself ever needing to enter the cell.
different types of interactions. For instance, the DIP Many receptors use an ingenious dimerization mech-
database is a curated collection of experimentally anism. The receptor has external, transmembrane,
determined protein–protein interactions (see http:// and internal segments. An external ligand binds to
dip.doe-mbi.ucla.edu/). It contains data about two molecules of receptor (see Figure 10.26). The
71 276 interactions between 23 201 proteins from juxtaposition of the external portions also brings the
372 organisms. internal portions together, because they are tethered
A limitation that remains is the difficulty of deter- to the external regions by the transmembrane segments.
mining structures of transient complexes, or of sys- Interaction between the interior segments triggers a
tems showing substantial conformational changes conformational change that activates a process such
upon assembly. The situation is shared with much of as phosphorylation of a protein. This may initiate
current molecular biology: we are coming to grips a signal transduction cascade that can transmit and
with static structures but are awaiting the develop- amplify the original stimulus (see Box 11.8). Within
ment of methods for treating the dynamics. the cell, ligand-induced dimerization may activate
DNA-binding domains (Figure 11.16).
Large-scale protein interaction networks are built
Structural biology of regulatory networks
up from many individual interactions. Figure 11.17
Many molecules involved in regulation are multi- shows a portion of an interaction network of yeast
domain proteins. Each domain in a multidomain proteins, based on sets of proteins that have been
protein is relatively free to interact with other mole- found together in solved structures.
cules. An interaction domain is a part of a protein
that confers specificity in ligation of a partner. Regu-
L-Dopa is used to treat Parkinson’s disease.
latory proteins contain a limited number of types of
Protein interaction networks 371
(b)
ATP_synt
COX3 Ran_BP1 UQ_con
RhoGAP
ATP_synt_do_C
COX1 ras ubiquitin
COX2_TM Peptidose_C12
F_box
Skp1
COX2 Peptidase_C48
PBD
COX5A
COX5B Ribosomd_S10
Ribosomd_S14 TIG
DUF232 ank
KE2 pkinase
RNA_pol_N
CKS
RNA_pol_A_boc UcrQ
SMC_C cyclin
SMC_N CK_II_beta
RNA_pol_L
RNA_pd_Rpb8 UCR_14kD
FAD_binding_2
RNA_pol_A
suoc_DH_flav_C
cytochrome_b_N
RNA_POl_Rpb5_C
fer2 TFIIS
Cytochrome_C1
Figure 11.17 Portion of the interaction network of yeast proteins: (a) describes the interactions of individual proteins, and (b) shows the interactions within a subnetwork based on
representations of different protein families, in different functional categories, linked in (a). This figure is based on structural data and modelling. Each relationship implies a physical
interaction between the proteins. Some of the interactions involve stable complexes (for instance, RNA polymerase II); others involve transient complexes.
From: Aloy, P. & Russell, R. (2005). Structure-based systems biology: a zoom lens for the cell. FEBS Lett. 579, 1854–1858.
Protein–DNA interactions 373
Protein–DNA interactions
• DNA packaging, including nucleosomes and viral • Some DNA-binding proteins are relatively non-
capsids. specific with respect to nucleotide sequence, including
DNA replication enzymes and histones.
Different processes require different degrees of DNA-
• Some, for instance EcoRV, bind to DNA with low
sequence specificity (see Box 11.9).
specificity but cleave only at GATATC. This com-
bination permits a mechanism of finding the target
Structural themes in protein–DNA binding and sequence by initial non-specific binding followed by
sequence recognition diffusion in one dimension along the DNA.
What does a protein looking at a stretch of DNA in • Some recognize specific nucleotide sequences. For
the standard B conformation see? (See Figure 3.12.) example, the EcoR1 restriction endonuclease binds
What could it hope to grab hold of? Prominent specifically to GAATCC sequences with almost abso-
general features are the sugar–phosphate backbone, lute specificity. It is a homodimer that recognizes
palindromic sequences.
including charged phosphates suitable for salt bridges
and potential hydrogen-bond partners in the sugar • Some DNA-binding proteins recognize consensus
hydroxyl groups. Contact with the bases is accessible sequences. For example, the phage Mu transposase
through the major and minor grooves, although and repressor proteins bind 11 bp sequences of the
unless the DNA is distorted the bases are visible only form CTTT[A/T]PyNPu[A/T]A[A/T] (where [A/T] = A
‘edge on’. Hydrogen-bonding patterns between bases or T, Py = either pyrimidine (C or T), Pu = either purine
(A or G), and N = any of the four bases).
in the grooves and particular amino acids account for
some of the DNA-sequence specificity in binding. • Some recognize nucleotide sequences indirectly, via
However, many protein–DNA hydrogen bonds are modulations of local DNA structure. For example, the
mediated by intervening water molecules, an effect TATA box-binding protein takes advantage of the
that tends to reduce the specificity. greater flexibility of AT-rich sequences to form com-
The idea that an a-helix has the right size and plexes in which the DNA is very strongly bent (see
Figure 11.23). The distinction between sequence
shape to fit into the major groove of DNA was noted
specificity achieved through direct interaction with
in the 1950s. The structures of the first protein–DNA
bases and specificity through recognition of local
complexes confirmed this prediction. It became the
structure has been termed ‘digital versus analogue
paradigm for protein–DNA interactions. Indeed, when
readout’.
a student solving the structure of the Met repressor–
DNA complex told his supervisor that, in the • Some recognize general structural features of DNA,
such as mismatched bases or supercoiling.
electron-density map he was interpreting, it looked
as if a b-sheet were binding in the major groove, • Some DNA-binding proteins form an initial complex
he was advised, with patience strongly tinged with with high DNA-sequence specificity, followed by
condescension, to go back and look for the helix. recruitment of other proteins of low specificity to
We now recognize great structural variety in DNA– enhance overall binding affinity or create a functional
protein interactions. A few examples include: complex.
Zinc fingers
Zinc fingers are small modules found in eukaryotic
transcription regulators. Each finger recognizes a
triplet of bases in DNA. Tandem arrays of fingers
recognize an extended region (see Figure 11.21).
Understanding the relationship between the amino
acid sequences of individual zinc fingers and the
DNA sequences they bind would permit modular
design of gene-specific repressors, by assembling a
sequence of fingers.
Figure 11.23 The TATA box-binding protein [1YTB]. The obvious feature of this complex is the very strong bending and unwinding
induced in the DNA. A long curved b-sheet sits against an unusually flat surface on the DNA, the result of prying open of the minor
groove. Phe sidechains intercalate between the bases.
Figure 11.24 The structure of the DNA-binding subunit of p53 shows a double-b-sheet fold [1TSR]. A helix sits in the major groove and
sidechains from loops connecting strands of the b-sheet insert into the minor groove.
example of initial binding of a protein to DNA importance because mutations in the gene for p53
followed by recruitment of other proteins to form are very common in tumours.
an active complex. p53 acts by surveilling genome integrity. Damage
to DNA induces enhanced expression of p53, which
p53 stalls cell-cycle progression. This gives time for DNA
p53 is a transcriptional activator and a tumour repair; if repair is unsuccessful, the ‘fail-safe’ mecha-
suppressor (see Figure 11.24). It is of great clinical nism is apoptosis.
Gene regulation
Cells regulate the expression patterns of their genes. The transcriptional regulatory network of
They sense internal cues to maintain metabolic sta- Escherichia coli
bility, and external cues to respond to changes in the
surroundings. The point of contact between genome Investigation of the mechanism of transcription regu-
and expression is the binding of RNA polymerase lation began with the work of F. Jacob and J. Monod
to promoter sequences, upstream of genes, to initiate on the lac operon in E. coli. The field has burgeoned,
transcription. This sensitive point is a juicy target for with comprehensive studies of the coli regulatory
regulatory interactions (see Box 11.10). network, together with work on other organisms,
Gene regulation 377
Figure 11.25 The E. coli transcriptional regulatory network represented as a directed graph. Colour-coding of nodes: transcription
factors are shown as blue squares; regulated operons are shown as red circles. Colour-coding of links: activators, blue; repressors, green;
indeterminate, brown.
From: Dobrin, R., Beg, Q.K., Barabàsi, A.L., & Oltvai, Z.N. (2004). Aggregation of topological motifs in the Escherichia coli transcriptional regulatory
network. BMC Bioinformatics 5, 10.)
The fork, also called the single-input motif, transmits a The ‘one-two punch’, also called the ‘feed-forward
single incoming signal to two outputs. Successive forks, loop’, affects the output both directly through the vertical
or forks with higher branching degrees, are an effective link; and indirectly and subsequently, through the inter-
way to activate large sets of genes from a single impulse. mediate link.
Generalizations of the binary fork include more down- This motif can show interesting temporal behaviour if
stream genes under common control (more tines to the activation of the target requires simultaneous input from
fork), and auto-regulation of the control node. Forks can both direct and indirect paths (logical ‘and’). Because
achieve general mobilization. Moreover, if the regulatory build-up of the intermediate requires time, the direct signal
genes have different thresholds for activation, the dynamics will arrive before the indirect one. Therefore a short pulsed
of building up the signal can produce a temporal pattern of input to the complex will not activate the output – by the
successive initiation of the expression of different genes. time the indirect signal builds up, the direct signal is no
The scatter configuration, also called the multiple input longer active. The system can thereby filter out transient
motif, can function as a logical ‘or’ operation: both down- stimuli in noisy inputs (Figure 11.26). Conversely, the
stream targets become active if either of the input impulses active state of the system can shut down quickly upon
is active. Generalizations of the square scatter pattern withdrawal of the external trigger.
shown may contain different numbers of nodes on both
* Shen-Orr, S.S., Milo, R., Mangan, S., & Alon, U. (2002). Network
layers. Note that scatter patterns are superpositions of motifs in the transcriptional regulation network of Escherichia coli.
forks. Nat. Genet. 31, 64–68.
(a) (b)
constant input to 1 pulse input to 1
1 1
output from 2 output from 2
2 2
input to 3 input to 3
3 3
when outputs from 1 and 2 outputs from 1 and 2
simultaneously exceed threshold, never simultaneously exceed threshold
node 3 fires therefore node 3 never fires
Figure 11.26 A ‘one-two punch’, or feed-forward loop, equipped with suitable AND logic at the downstream node, can filter out
transient noise. (a) Constant input; (b) pulse input. The mechanism of signal transmission is the synthesis of a stimulatory molecule
by an activated node. To avoid ‘locking the signal on’, this molecule must subsequently be removed. The effect described here
depends on the time course of build-up and decay of the signal.
The dynamic properties of the network are also of Even a constant network can produce different
interest. These include both the response of the net- outputs from different inputs. However, even within
work to changing conditions, as in the lac operon, an organism, networks can change their structure in
and the comparison of regulatory networks in related response to changes in conditions. This can affect
organisms to understand how networks evolve. even some of the hubs of the network, the points at
380 11 Systems Biology
which changes have the most far-reaching effects. CAP site Operator lacZ lacY lacA
This has been examined most closely in yeast (see
p. 383). Low level of transcription
(a) Promoter region
Similarities and differences in the regulatory inter-
CAP
actions in related organisms illuminate how their
RNA polymerase lacZ lacY lacA
networks evolve. The evolutionary retention of tran-
scription factors is smaller than that of target genes.
High level of transcription
Even transcription factors that serve as hubs are (b) Promoter region
not more highly conserved. Different organisms are
relatively free to explore different regulatory path- rep lacZ lacY lacA
ways, even to regulate orthologous genes. This may
well be the ‘other side of the coin’ of the redundancy No transcription
(c) Promoter region
in the networks that provides robustness.
Figure 11.27 States of the lactose operon. (a) The promoter
There is evidence that larger changes in regulatory
region contains regulatory sites upstream of the protein-encoding
networks reflect changes in lifestyle. Organisms with genes lacZ, lacY, and lacA. (b) Binding of CAP to its upstream
similar lifestyle – several species of soil bacteria, gen- site within the promoter enhances the binding affinity of RNA
uses Bacillus, Corynebacterium, and Mycobacterium polymerase, turning transcription on. (c) Binding of repressor
– conserve regulatory interactions, as do intracellular blocks binding of RNA polymerase, turning transcription off.
parasites Mycoplasma, Rickettsiae, and Chlamydiae. Two additional subsidiary repressor binding sites are not shown.
analogues bind to CAP and lac repressor proteins, The operator site is between the CAP site and the
respectively, to control their binding affinities. origin of transcription. As a result, lactose absence
‘trumps’ glucose absence. That is, in the absence of
• Binding of lactose induces a conformational
lactose, the binding of repressor stops transcription
change in the repressor, from a tightly binding
whether or not glucose is absent.
form to a weakly binding one. Lac repressor will
The effect is to express the proteins of the lac
bind only in the absence of lactose.
operon only if the medium contains only lactose.
• The presence of glucose reduces the concentration
The bacteria are saying, in effect, ‘I prefer to grow on
of cyclic AMP, causing a conformational change in
glucose. If glucose is there, I don’t want high expres-
CAP to a weakly binding form. CAP will bind only
sion levels of the genes for lactose transport and
in the absence of glucose.
metabolism, even if lactose is present. Only if lactose
The actual molecule that binds to repressor to reduce is present and glucose is not present, express the
its affinity for its site on DNA is allolactose. Allolac- genes that transport and cleave lactose.’
tose is an isomer of lactose, produced from lactose The lactose operon switch is an example of a ‘fire-
by b-galactosidase. Alternative lactose analogues also and-forget’ mechanism. Once the mRNA is synthe-
stimulate transcription. One that is useful in the sized, what happens on the DNA does not affect it.
laboratory is isopropylthiogalactoside (IPTG), for However, the mRNA for the protein-coding genes of
two reasons: (1) IPTG enters the cell even if the lacY- the lac operon (lacZ, lacY, and lacA) has a half-life of
encoded transporter is dysfunctional or not expressed; ∼3 minutes. In the absence of continuous or repeated
and (2) IPTG is not metabolized; therefore its con- induction, synthesis of the lactose-metabolizing
centration stays constant during the course of an enzymes will cease within minutes. This resets the
experiment. switch.
The switch thereby responds to the type of sugar in
medium. Logical diagram of the lac operon
• If both glucose and lactose are present, neither Figure 11.28 represents the lac operon control logic
control protein binds (Figure 11.27a). RNA poly- as a network. This diagram is almost equivalent to
merase binds only weakly. Transcription occurs at the table showing the response to presence and
a low basal level. absence of glucose (see Exercise 11.7).
• If glucose is not present and lactose is present,
the CAP–cAMP complex binds to the promoter.
Glucose Lactose
The binding of CAP–cAMP with RNA polymerase present present
is cooperative. Interactions with CAP–cAMP
increase the affinity of RNA polymerase, thereby
stimulating transcription to approximately 40
AND
times the basal level (see Figure 11.27b).
• If lactose is not present, the repressor binds to the
operator site. This blocks RNA polymerase and
turns off transcription (see Figure 11.27c).
Transcription?
In summary (− means ‘absence of’; + means ‘pres-
ence of’):
Figure 11.28 Logical diagram of lac operon. Green arrows
Lactose show positive regulation. The red ‘T’ shows negative regulation.
− + The circle containing AND will pass through a positive signal
only if glucose is not present (red ‘T’) and lactose is present
− Repression Robust
(green arrow). For many regulatory circuits, we know many
transcription
Glucose of the inputs to a node but do not know the logic.
+ Repression Basal level of
transcription
382 11 Systems Biology
A recent study of transcription regulation in yeast Figure 11.29 is a cartoon-like sketch of a fragment of
treated a network containing 3459 genes, corre- such a network indicating, rather loosely, some of its
sponding to approximately half of the known pro- general features. Nodes are divided into transcrip-
teome of S. cerevisiae. The genes included 142 that tional regulators, shown as circles, and target genes,
encode transcription regulators and 3317 that encode shown as squares. Target genes are distinguished
target genes exclusive of transcription regulators. by having no output connections. There is extensive
There are 7074 known regulatory interactions among interregulation among the transcription factors, to
these genes, including effects of regulators on one a much higher density of interconnections than can
another and of regulators on non-regulatory targets. intelligibly be shown in this diagram. Think of a
Analysis of the overall network architecture seething broth of transcription factors, within the
revealed several features. shaded area, sending out signals to target genes. The
shaded area indicates only the logical clustering of
• The distribution of incoming connections to target the transcriptional regulators. There is no suggestion
genes has a mean value of 2.1 and is distributed about physical localization; indeed, transcriptional
exponentially. Most target genes receive direct regulators interact with DNA and almost never inter-
input from about two transcriptional regulators. act physically with the proteins whose expression
The probability that a gene is controlled by k tran- they control.
scription regulators, k = 1, 2, . . . , is proportional
to e−ak, with a = 0.8.
• The distribution of outgoing connections has a
mean value of 49.8 and obeys a power law. The Transcriptional regulators
. . . 50 outgoing connections. . .
probability that a given transcriptional regulator
... ...
controls k genes is proportional to k−b, with b = 0.6.
Power-law behaviour is common in networks and
characterizes topologies in which a few nodes – the
‘hubs’ – have many connections and many nodes
have few. In regulatory networks, hubs tend to Five intermediate nodes
be fairly far upstream, forming important foci of
regulation with far-reaching control.
• The average number of intermediate nodes in a
minimal path between a transcriptional regulator
Target genes
and a target gene is 4.7. The maximal number of
intermediate nodes in a path between two nodes Ultimate receptor of signal
Each transcriptional regulator directly influences activities (as we can, for the most part, with meta-
approximately 50 genes on average, although, as bolic enzymes). Instead, the activity of the network
with other ‘small-world’ networks following power- involves the coordinated activities of many indi-
law distributions of connectivities, the distribution is vidual regulatory molecules.
very skewed – some ‘hubs’ have very many output
connections, but most nodes have very few. A few of
Adaptability of the yeast regulatory network
the interregulatory connections between transcrip-
tion factors are shown in red. In about 10% of cases, The yeast regulatory network achieves versatility and
two neighbours of the same transcription factor responsiveness by reconfiguring its activities. This
interact with each other. A path from one regulator is seen by comparing the changes in the activities of
(filled black circle) to one ultimate receptor (filled networks controlling yeast gene expression patterns
black square), through five intermediate nodes, is in different physiological regimens of the organism:
shown in black. The intermediate nodes are other cell cycle, sporulation, diauxic shift (the change
transcriptional regulators, connected both within the from anaerobic fermentative metabolism to aerobic
path drawn in black and off this path. Even the tran- respiration as O2 levels increase), DNA damage,
scription factor used as the origin of the path receives and stress response. Cell cycling and sporulation
input connections. Although it is possible to identify involve the unfolding of endogenous gene expression
target genes from the absence of outgoing connec- programmes; the others are responses to environ-
tions, it is more difficult to identify ultimate initiators mental changes.
of signal cascades. Different states are characterized both by similar-
The ultimate receptor is a target gene that receives ities and differences in gene expression patterns and
regulatory input but itself has no output links. This by the components of the regulatory network that
target is expected to receive (on average) a second are active. There is considerable shift in expression of
control input. The black target node receives input target genes. About a quarter of the target genes are
via a black arrow, along the selected path, and via specialized to individual physiological states. Of the
a red arrow suggesting the second input. Of course total of about 3000 target genes, the expression
the second input may arrive via a path that shares levels of only about half do not show major changes
common nodes with the black path, including other in the different states. Of the 1906 that show altered
routes from the filled black circle. expression levels in different states, almost half (803)
The dense forest of additional pathways from are specialized to a single physiological state.
which this fragment is extracted is not shown. Some In contrast, different states show much more over-
‘back-of-the-envelope’ calculations indicate: (1) there lap in the usage of transcriptional regulators. For
are ∼3500 nodes, each receiving an average of two instance, for cell-cycle control, 280 target genes (8%)
input connections; (2) there are ∼140 transcription are differentially regulated by 70 (49%) of the tran-
factors, making an average of 50 output connec- scription regulators. Clearly, there is a much greater
tions; (3) the number of input connections must degree of specialization in the target genes. In gen-
equal the number of output connections, and indeed eral, half of the transcription factors are active in at
3500 × 2 = 140 × 50 = 7000. least three of the five physiological regimens. How-
Given the complexity, it is difficult to illustrate ever, contrasting with the high overlap of usage of the
larger segments of the network in more detail than transcriptional regulators (the nodes), the overlap of
the simplified version appearing in Figure 11.29. the activities within the network (the connections)
Analysis of the structures of regulatory networks is relatively low. Different components of the inter-
is an active current research topic. The motifs action network organize the different gene expression
described in Box 11.11 are the ‘secondary structures’ patterns in different states.
of network architectures. Whereas different physiological states are charac-
The high ratio of interactions to transcription terized by substitutions of different sets of synthe-
regulators implies that we cannot expect to associate sized proteins, the regulatory network uses much
individual regulatory molecules with single, dedicated of the same structure but reconfigures the pattern
384 11 Systems Biology
of activity. Think of the transcription factors as state, which permits finer control over the temporal
‘hardware’ and the connections as reprogrammable course of expression patterns. In cell-cycle control
‘software’. The molecules do not change but the and sporulation, there is a much denser interregula-
interactions do: in different states, many transcrip- tion among transcription factors and longer minimal
tion regulators change most, or a substantial part, of path lengths between transcriptional regulators and
their interactions. In particular, the set of transcrip- target genes.
tion regulators that forms the hubs of the network – Different physiological states also differ in their
those with many outgoing nodes that form foci of usage of the common motifs – fork, scatter, and ‘one-
control – are not a constant feature of the system. two punch’ (see Box 11.11). Forks are used more in
Some hubs are common to all states, but others step conditions of stress, diauxic shift, and DNA damage.
forward to take control in different physiological They are appropriate to the need for quick action.
regimens. The result of the reconfiguration of activity Requirements for a build-up of intermediates would
is that over half of the regulatory interactions are delay the response. Conversely, the ‘one-two punch’
unique to the different states. motif is more common in cell-cycle control. This is
The effect of the changes in the active interaction consistent with the need for a signal from one stage
patterns is to alter the topological characteristics of to be stabilized before the cell enters the next stage.
the network in different states. For instance, under Much of evolution proceeds towards greater spe-
panic conditions – DNA damage and stress – the cialization. The human eye is a classic example. It is an
average number of genes under the control of indi- intricate and fine-tuned structure, features that were
vidual transcriptional regulators increases; the aver- once adduced as evidence against Darwin’s theory.
age minimal path length between regulator and target Many evolutionary pathways show a trade-off between
decreases; and the clustering becomes less dense specialized adaptation and generalized adaptability.
(i.e. there is less interregulation among transcription Regulatory networks are an exception. Evolution
factors). This can be understood in terms of a need has produced structures that are both specialized
for fast and general mobilization – the equivalent of and versatile. The reconfigurability of regulatory net-
broadcasting ‘Go! Go! Go!’ over the radio. Normal works allows them to respond robustly to changes
circumstances – cell-cycle control, for instance – in conditions by creating many different structures
allow for a more dignified and precise regulatory specialized to the conditions that elicit them.
● RECOMMENDED READING
• The following papers describe the structure, dynamics, and evolution of cellular signalling and
regulatory networks:
Ideker, T. (2004). A systems approach to discovering signaling and regulatory pathways – or,
how to digest large interaction networks into relevant pieces. Adv. Exp. Med. Biol. 547,
21–30.
Babu, M.M., Luscombe, N.M., Aravind, L., Gerstein, M., & Teichmann, S.A. (2004). Structure
and evolution of transcriptional regulatory networks. Curr. Opin. Struct. Biol. 14, 283–291.
Luscombe, N.M., Babu, M.M., Yu, H., Snyder, M., Teichmann, S.A., & Gerstein, M.B. (2004).
Genomic analysis of regulatory network dynamics reveals large topological changes. Nature
431, 308–312.
Exercises
Exercise 11.1 In the undirected, unlabelled graph in Box 11.3 on page 344, (a) name two vertices
such that if you add an edge between them at least one vertex has exactly four neighbours. (Note
that two edges may cross without making a new vertex at their point of intersection.) (b) Name
two vertices such that if you add an edge between them to the original graph, the graph becomes
an (unrooted) tree. (c) Name two vertices (neither of them V1) such that if you add an edge
between them to the graph produced in (b), the resulting graph does not remain a tree. (d) Name
two vertices such that if you add an edge between them to the original graph, there is exactly one
path between V1 and V3, with no vertices repeated, and it has length 4. (e) Name two vertices
such that if you add an edge between them to the original graph, there are alternative paths, of
lengths 3 and 4, between V1 and V5, with no vertices repeated. (In determining the length of a
path, you have to count the number of edges in the path. A path of length 2 between V1 and V5
contains one intermediate vertex.)
Exercise 11.2 Of the examples of graphs in Box 11.4, (a) which are directed graphs? (b) Which
are labelled graphs? (c) In each example, what is the set of nodes? (d) In each example, what is
the set of edges?
Exercise 11.3 What information is contained in Figure 11.3(b) that could not be recovered from
the kind of data produced by the experiments shown in Figure 11.13?
Exercise 11.4 For which of the methods for determining interacting proteins (pp. 366ff) (a) must
one of the proteins be purified; (b) must both of the proteins be purified?
Exercise 11.5 The binding site for l Cro (p. 374) is an approximate palindrome, i.e. the two
strands contain approximately the same sequence in reverse order. (a) On a copy of the binding
site, indicate which six residues best fit the palindrome pattern. (b) A palindromic binding site can
interact with a dimeric protein by presenting surfaces of similar structure to both protein subunits.
How far apart are the six-residue regions identified in part (a)? How do you rationalize that
distance in terms of features of the structure of DNA?
Exercise 11.6 From Figure 11.3, (a) what would be the effect of increased expression of metJ
on the expression of metC? (b) What would be the effect of increased expression of OxyP on
the expression of metJ? (c) What would be the effect of increased expression of OxyP on the
expression of metP?
Exercise 11.7 Redraw Figure 11.28 with the top boxes containing glucose absent and lactose
absent instead of glucose present and lactose present.
386 11 Systems Biology
Exercise 11.9 In the London Underground: (a) What is the shortest path between Moorgate and
Embankment stations? Note that, considered as a graph, the shortest path between two nodes
is the path with the fewest intervening nodes, not the path that would take the minimal time
or fewest interchanges. (b) What is the shortest cycle containing Kings Cross, Holborn, and
Oxford Circus stations? (c) The clustering coefficient of a node in a graph is defined as follows.
Suppose the node has k neighbours. Then the total possible connections between the neighbours
is k(k − 1)/2. The clustering coefficient is the observed number of neighbours divided by this
maximum potential number of neighbours. If the neighbours of a station are the other stations
that can be reached without passing through any intervening stations, what is the clustering
coefficient of the Oxford Circus station? (If necessary, see http://www.bbc.co.uk/london/travel/
downloads/tube_map.html)
Exercise 11.10 In the London Underground, (a) what is the maximum path length between any
two stations? (That is, for which two stations does the shortest trip between them involve the
maximum number of intervening stops?) (b) If the District Line were not active, what stations, if
any, would be inaccessible by underground? (c) If the Jubilee Line were not active, what stations,
if any, would be inaccessible by underground?
Exercise 11.11 On a photocopy of the three common network control motifs (Box 11.11, p. 378),
(a) indicate which nodes are controlled by only one upstream node; (b) indicate which node exerts
control over only one downstream node.
Exercise 11.12 On a photocopy of the simplified fragment of the yeast regulatory network
(Figure 11.29) indicate examples of the network control motifs (a) star and (b) ‘one-two punch’.
(c) Add one arrow to create a scatter motif.
Exercise 11.13 In the overall yeast transcriptional regulatory network, the number of incoming
connections to target genes follows an exponential distribution, i.e. the probability that a gene is
controlled by k transcriptional regulators is proportional to e−ak, with a = 0.8, k = 1,2, . . . . What is
the ratio of the number of target genes receiving four input connections to the number receiving
two input connections?
Problems
Problem 11.1 In one species, only enzyme A catalyses conversion S → P, the rate-limiting step
of a reaction pathway. What is the flux control coefficient of enzyme A? (b) In a related species,
distinct but similar enzymes A and B both catalyse the S → P reaction. The kinetic characteristics
of A and B are identical: S → P is still the rate-limiting step of the pathway. What is the flux control
coefficient of A?
Problem 11.2 For dissociation of a complex involving a simple equilibrium:
[A][B]
AB j A + B, the equilibrium constant, KD = , is equal to the ratio of
[AB]
forward and reverse rate constants: KD = koff/kon.
Exercises, problems, and weblems 387
For avidin–biotin, KD = 10−15. Suppose kon were as fast as the diffusion limit, ∼ 10−9 M−1 s−1.
(a) What is the value of koff? (b) What would be the half-life of the avidin–biotin complex?
(c) Suppose kon for avidin–biotin were 10−7 M−1 s−1. What would be the half-life of the complex?
Problem 11.3 Write detailed structures for all of the metabolites that appear in Figure 11.3a.
Problem 11.4 The network of metabolic pathways must obey constraints of thermodynamics
and physical-organic chemistry. Meléndez-Hevia and colleagues suggested the principle that
metabolic pathways are optimized, subject to the constraints, for the minimum number of steps.
The non-oxidative phase of the pentose phosphate pathway converts six five-carbon sugars to
five six-carbon sugars:
6 ribulose-5-phosphate → 5 glucose-6-phosphate
A simplified model of a pathway for this conversion is a series of steps, each of which is either:
Represent each sugar only by a number of carbon atoms. Starting with five five-carbon sugars,
one possible initial step would be a transketolase step converting two five-carbon sugars to a
three-carbon sugar and a seven-carbon sugar. Assume that all intermediates must have at least
three carbon atoms.
Create a tableau with the following initial and final states (an initial transketolase (TK) step is
also shown):
0 5 5 5 5 5 5
TK
1 3 7 5 5 5 5
...
N 6 6 6 6 6 0
Copy and fill in the tableau to find the shortest route from the top (step 0, six five-carbon sugars)
to the bottom (five six-carbon sugars). Identify the intermediates created. Compare this with the
observed metabolic pathway.
Problem 11.5 Choose 15 amino acids, by crossing off the ones most similar to others. Devise a
doublet genetic code for these 15 residues, plus a stop signal, that is as close as possible to the
actual triplet code.
Problem 11.6 (a) You create a strain of E. coli in which the order of promoter and operator in
the lac operon are reversed. Will this strain express lacZ, lacY, and lacA in the absence of lactose
and glucose? (b) You create a strain of E. coli by moving the operator from its normal position
to a position between the lacZ and lacY genes. Will this strain express lacZ, lacY, and lacA in the
absence of lactose and glucose? Will this strain express lacZ, lacY, and lacA in the presence of
lactose and absence of glucose? (c) You add exogenous cyclic AMP to wild-type E. coli. Will lacZ,
lacY, and lacA be expressed in the presence of glucose and lactose? (d) You create a strain of
E. coli with a point mutation in lacZ that renders the enzyme completely dysfunctional. Will this
strain express lacY in the presence of lactose? Will this strain express lacY in the presence of
isopropylthiogalactoside (IPTG)?
Problem 11.7 Indicate how to connect a selection of the three common network control motifs
so that a single input node can influence three output nodes.
388 11 Systems Biology
Problem 11.8 What is the minimum number of ‘yes-or-no’ questions required to identify a specific
letter of the upper-case alphabet: ABC . . . Z?
Weblems
Weblem 11.1 In the methionine biosynthetic pathway (see Figure 11.3), the product, methionine,
inhibits an enzyme in the middle of the pathway, homoserine O-succinyltransferase. (a) The
accumulation of which intermediate might this inhibition be expected to cause? (b) In what
other pathways is this intermediate involved that might use it up?
Weblem 11.2 Figure 11.4 shows the amino acid biosynthesis leading from aspartate to
methionine, threonine, and lysine. On a photocopy of this figure, write in the names of the
omitted intermediates at the unlabelled positions between consecutive arrows.
Weblem 11.3 The genes encoded by metC and malY in E. coli convert cystathione to
L-homocysteine. Each has another function in addition. What are these other functions?
Weblem 11.4 Compare the pathways for biosynthesis of chorismate in E. coli, M. jannaschii, and
Aeropyrum pernix. What is the earliest common intermediate in these pathways? What are its
precursors in the three species?
Weblem 11.5 Define the following terms: (a) interactome; (b) signalome. (c) This is more difficult:
can you think of, and define, a reasonable ‘ome’ that has not yet been proposed?
Weblem 11.6 Compare the methionine biosynthesis pathway from asparate to methionine in
E. coli and yeast. (a) Are there any differences in the series of intermediates? (b) Are the enzymes
that catalyse similar transformations homologous? Show alignments of the amino acid sequences
where possible. Use EcoCyc for E. coli (http://ecocyc.org) and the Saccharomyces Genome
Database (http://www.yeastgenome.org) for yeast, searching in each case for ‘methionine
biosynthesis’.
Weblem 11.7 Draw Figure 11.27(a) to scale, using the known E. coli genome sequence as a
source of the true sizes of the regions.
Weblem 11.8 Find the page in EcoCyc corresponding to Figure 11.3. Choose different levels of
detail and list what types of information are presented at each level.
Weblem 11.9 An enzyme with a function related to peptidylglycine monooxygenase (EC
1.14.17.3), and linked to it in the ENZYME DB, is 1-aminocyclopropane-1-carboxylate oxidase.
(EC 1.14.17.4). (a) What is the lowest common ancestor of the two reactions in the EC
classification? (b) What is the lowest common ancestor of the two reactions in the Gene Ontology
molecular function classification? (c) Are these two enzymes closely related in the Gene Ontology
classification?
Weblem 11.10 What identifiers does Gene Ontology associate with E. coli asparate
aminotransferase, in the molecular function category? Arrange them in a directed acyclic graph,
indicating the parent–child relationships between these identifiers.
Weblem 11.11 According to EcoCyc, what reactions can orotidine 5′-monophosphate undergo?
What enzymes catalyse these reactions? What genes encode these enzymes?
Weblem 11.12 Figure 11.5 shows the reductive carboxylate cycle, and the EC numbers of the
enzymes that catalyse the individual steps. Find the corresponding information for the tricarboxylic
acid cycle (or Krebs cycle), and for the glyoxylate cycle. Do an alignment of the metabolites
participating in these cycles, display the EC numbers of the enzymes that correspond to different
reactions in different cycles. Report what is common to pairs or to all three of these pathways, in
terms of (a) metabolites, (b) links between metabolites, corresponding to reactions, (c) enzymes
that catalyse the reactions.
EPILOGUE
The new century has already seen major achieve- • We will achieve a more profound understanding of
ments in genomics. Sequencing of the human genome what life is and how it works.
is the jewel in the crown. And yet, the field is still
in a preparative and anticipatory stage. We can be We will also gain greater control over living systems.
confident that: Many applications will emerge, in clinical, agri-
cultural, and technological fields. Some of these
• Methods of sequence determination will increase are relatively simple extrapolations from what has
in power. There will be an explosion in the number already been achieved. Others seem more like the
of complete sequences of different human beings stuff of science fiction: a tightly coupled silicon–life
and of many other organisms. interface and the in vivo deployment of nanoparticles
• Tools for analysis will make progress. Better algo- sensing and interacting with our biochemical states.
rithms, making effective use of more data, will pro- Nevertheless, they represent natural developments of
duce more reliable inferences. Modelling of structure the current state of the art.
and process will improve, towards the target goal Understanding the genome may ultimately release
of the simulation of the complete cell in silicio. us from its constraints.
INDEX
a1-antitrypsin 22, 43, 58, 331 ancient DNA, see palaeosequencing AT-rich 346, 373
a-helix 303ff. Angelman syndrome 16, 39, 85 – 86, autism 51
b-blocker 23 139 –140 autoantibody 276
b-galactosidase 368, 380 –381 angiogenin 240 autoimmune disease 57
b-sheet 303ff angiosperm 122, 218, 248, 263 autopolyploid 141
l cro 374 annelida 119 autoradiograph 18, 79, 96 –97, 113, 307
antennapedia 374 –375 autoregulatory 363, 377, 379
anthocyanin 15, 280 –281 autosome 147, 224, 258, 285
A anthopleura 49 avoparcin 287–288
anthropoid 138
aardvark 227, 236 antibiotic 11, 16 –17, 23, 108, 127, 129,
acetaldehyde 274 155, 191–192, 196, 205 –207, 212, B
acetylation 44, 305 269, 287–289, 293 –295, 330
acetylcholinesterase 45, 305 antibody 9, 38, 123, 153, 177, 186, 206, baboon 152–153, 324
acetyl-CoA 204, 273, 352, 358 208, 268, 276, 291, 298, 327, 330, backbone (protein) 299ff.
achiral 301 333, 367–368 BacMap 130 –131
acorn worm 119, 220 anticoagulant 77, 187 bacterial artificial chromosome (BAC)
actinobacteria 204 –205 anticodon 14, 111, 293, 347 98ff.
actinopterygii 152 antigen 38, 55, 61, 124, 126, 177, 187, bacteriophage 18, 81, 94, 96, 120, 125,
acylphosphatase 302–303 206, 333 131, 207, 291, 334, 367, 373 –374
adenosylcobalamin 47 anti-inflammatory 271 display 367
adenovirus 125 anti-influenza 128 bacteriorhodopsin 140, 195
adenylate cyclase 371 antimalarial 59 Barcode of Life 117, 154
adrenoleukodystrophy 43, 73, 125 antiporter 204 barley 108, 242
Aepyornis 230 antipsychotic 52 barrel 300, 313, 326
Affymetrix 271, 282 antithrombin 187, 320, 322 base pair 92
agammaglobulinaemia 150 antiviral 23 basophils 291
aggregation 60, 331 Aplysia 108 B-cell 283
aldolase 177–179, 190, 360 –361, Apolipoprotein E, 51, 61– 62 Beadle, G. 244, 262
364 –365, 387 apoptosis 62, 364, 376 Bernal, J.D. 302
alfalfa 141 apyrase 102 bilateria 119
alga 13, 31, 47, 67– 68, 118, 120, 122, Arabidopsis thaliana 8, 13, 38, 98, 108, biodiversity 165, 187
132–133, 156, 217–218 132–133, 136, 141, 215, 218 –220, biofuels 188
alignment 167 247–249, 261 biogeography 260
approximate methods 173 archaea 17, 68, 115, 118, 122, 129 –136, bioinformatics 24ff., 167ff.
dot-plot and 169 192–204, 358 –361, 365 bioluminescence 343
multiple 174 Archaeopteryx 224 biotechnology 17, 25, 27, 106, 187, 298
structure 176 armadillo 149, 152, 227 biotin 224, 333, 368, 387
alkaloid 219, 247–248 aromatase 5 bipolar 77
allergen 187, 291 arousal 274 bisphosphoglycerate 319, 359ff.
allolactose 380 –381 arrhythmia 150 BLAST (Basic Local Alignment Search
allopolyploid 141 arthritis 57, 271 Tool) 173ff.
allosteric 47, 318ff. arthropod 116, 119, 162 BLOSUM matrices 172–173, 175
alveolates 217 artiodactyl 152 bluetongue virus 125
Alzheimer’s disease 51, 58, 61– 62, Ashburner, M 100, 148, 352 Bombyx mori 108
149 –150, 305, 331–332 Ashkenazi Jews, common BRCA gene bottleneck 52, 143, 227, 242, 244, 336
amino acid 6ff., 200, 299ff. mutants 63 – 64, 77 BRCA1, BRCA2 18, 62– 65, 77, 116,
substitution 172 asparaginase 292 220
amphioxus 49, 119, 141 aspartylglucosaminuria 305, 339 breakpoint 87, 271
ampicillin 16 Aspergillus 159, 302 Brenner, S. 50, 147, 281, 312, 384
amplicon 95 assembly (sequence) 18 –19, 94, 98ff., bryophyte 218
amyloidosis 58, 331–332 112–113, 151, 247, 292 bryozoa 119
amyotrophic lateral sclerosis (ALS) 124 asteroid impact 42, 67– 69, 192 BSE (bovine spongiform encephalopathy)
anaemia 4, 58 – 60, 179, 331–332 asthma 140, 266, 271 331–332
anaerobe 196 –197, 201 astrocyte 283 bubonic plague 145, 205, 346
anaerobic 5, 108, 158, 205, 268, 273, ataxia 51, 149 Buddenbrockia 48 – 49
383 ATPase 204, 317, 330 buffalo 20
Index 391
chip 18, 183, 201, 266 –268, 271, 282, coverage 19, 98 –100
C 292 craniata 29
chip-seq 18 C-reactive protein 380
cadherin 283 Chironomus 31 creatine 314 –315, 337–339
Caenorhabdidis elegans 108, 136, 274, chiroptera 152 C-region 326
285, 332 chloroperoxidase 339 Creutzfeldt–Jakob disease 331–332
development 48 chlorophyll 195, 205, 210, 213 Crick, F.H.C. 6, 41, 43, 91–92, 111,
genome 49, 122–123, 140, 147–149, chlorophyte 218 176, 306
151, 153, 220 chloroplast 4, 6, 11–13, 15, 37–38, criollo 247–248, 261
nervous system 50 –52, 151 132–133, 155 –156, 162, 205, 219, Critical Assessment of Structure
caffeine 247, 263, 274 235, 317 Prediction 322
calcium 42, 72 chloroquine 59 crossing-over 83
Cambrian 67– 68, 137, 192 chocolate 245 –250, 261–263 cross-linking 287, 289, 306
cancer 5, 16, 18 –20, 23, 42, 47, 62– 65, cholera 122, 129, 343 cultivar 247, 279 –280
71–72, 87, 108, 116, 149 –150, 206, cholesterol 61, 153, 276 C-value 121
224, 268, 290 –291, 364 chordate 29, 116, 119, 141, 220 –221, cyclic-AMP 380
canidae 228 235 cystathione 356, 388
cantaloupe 132 choreatic 22 cystic fibrosis 17, 90, 125, 150, 153, 266
capillaries 59 – 60, 97, 309 chorismate 204, 360, 388 cytoglobin 26, 30, 137, 156, 235 –236
capsid 124, 330, 333, 373 –374 chromatin 8, 12, 47, 81, 84, 132, 142, cytokine 108, 284
carbamazapine 77 147–148, 154, 284, 353 cytoskeleton 11, 284, 298, 317, 371
carboniferous 68, 218 immunoprecipitation 367–368
carboxykinase 204 remodelling 45
carnivora 152, 227–228 chromogenic 368 D
carotenoid 289 chromophore 140, 210, 306, 362, 368
carrier 17, 60 – 61, 120, 127, 221, 268, chromosome 26ff, 142 database 24, 30, 104ff.
315, 346 chymotrypsin 155 bibliographic 109
cartilaginous 119 cilia 23, 34, 118, 317 BLOCKS 172
CASP (see Critical Assessment of Ciona intestinalis 108, 141, 159, clusters of orthologous groups (COG)
Structure Prediction) 220 –221, 235 –236 130
cassowary 182, 230 circadian rhythm 52, 219, 241, 275, 364 DNA 106, ethical issues 32ff.,
cat (cloned) 46 cis-aconitate 358 barcoding 117
catalase 274 cis-regulator 8, 377 ESTs 108, 279
catarrhini 29, 152 citrulline 305 enzyme structures 352
catfish 108 c-jun 327 expression and proteomics 108ff.
CATH 107, 113, 310 cladistics 184 genetic diseases 106 –107
cDNA 14, 64, 72, 95, 108, 267–268, clotting, see coagulation genome browsers 25
271, 277, 279, 282, 292, 294 cloverleaf (tRNA) 13 genomes (GOLD) 216
Celera Genomics 19 –20, 23, 100, CLUSTAL-W 171, 173, 183, 213 homophila 149
147–148, 295, 307, 309, 337 cnidaria 48 – 49, 119 metabolic pathways 109, 354
cell-adhesion 278 coagulation 43, 143, 179, 291 MITOMAP 251
cell-cycle 47, 244, 284, 292, e363, 376, codon 6, 13 –16, 36 –37, 60, 136, 162, nucleic acid structures 107
383 –384 212, 337, 347–348 protein function classification 351ff.
cellulase 196 usage 197–198 protein interactions 366, 370
cellulose 196, 205, 221, 235 co-enzyme 130 –131, 202, 333 protein sequences 106
cell-wall 289 –290, 357 cofactor 61, 217, 224, 314, 351, 355, protein structures 107, structure
centimorgan 111 359 –360 classification 107, 310, prediction
centromere 8, 26, 85, 147 coiled-coil 326 –327 325
channel 219, 226 coimmunoprecipitation 367 quick screening of 173ff.
chaos 144, 349 –350, 363 –364, 384 co-infect 127 Saccharomyces genome database
chaperone 130 –131, 200, 202, 330 co-inheritance 85 217–218
Chargaff’s rules 91–92 colitis 204 SNP databases 53 –54
Chase, see Hershey–Chase collision-induced ionization 311 Species 2000 165
cheetah 52 complexity 346ff. taxonomy 165
chemiosmotic 12, 317 concanavalin A 306 daunorubicin 292
chemotaxis 312, 315, 377 conformation, protein 300ff dbEST 108, 294
chemotherapy 290 congenital 179 dbSNP 53
CheY 312 conifer 218 defensin 226
chimaera 120, 208 contig 18 –19, 90, 99 –100, 104 dehydroquinate 360
chimpanzee 42, 50, 71, 86, 88, 110, CopyCat 46, 73 Deinococcus radiodurans 122, 205
117ff., 139ff., 149, 152, 156 –157, co-repressor 375 delphinidin 280 –281
227, 250, 268, 285 –287, 294 corn, see maize dementia 332
392 Index
denaturation 9 –10, 200, 304, 307–308, electroreception 225 fern 68, 122, 218
310 electrospray ionization 309 ferredoxin 204, 210, 358, 360 –361
dendrites 44 elephant 119, 152, 230 –231, 233 –235 ferritin 298
deoxyhaemoglobin 166, 319, 321, 331 Elm yellows 42 fibre diffraction 91–92
dephosphorylation 273 EMBL-BANK 106 fibronectin 137–139, 303
depolarization-sensitive 276 emmer wheat 141, 242 fibrosis 150
detoxification 273, 380 emphysema 22, 39, 58, 291, 331 flavodoxin 312
deuterostome 118 –119, 157, 220 enamel 224 flavonoid 246 –247, 281
dexamethasone 271, 293, 371 enantiomer 195 ‘Flavr Savr’ tomato 46
diabetes 34, 43, 57, 61, 364 encephalopathy 331–332 flow cytometry 290
diauxic shift 272–274, 383 –384 ENCODE 151–154 fluorophore 266
dideoxynucleotides 79, 95 –97 modENCODE 153 fluorescence resonance energy transfer
dihydroflavonol 281 endocytosis 132, 283 (FRET) 368
dihydrokaempferol 280 –281 endogluconase 221 foetal haemoglobin 26, 45, 319
dihydromyricetin 280 –281 endonuclease 88, 189, 373 forestero 247–248
dihydroquercetin 281 endopeptidase 306 fork 141, 354, 377–379, 384
dihydroxyphenylalanine 371 endoplasmic reticulum 11 fossil 55, 66 – 67, 71, 76, 118 –119, 182,
dimorphism 231 endoreplication 143 194, 216, 227, 229, 234, 250
dinornithidae 230 –231 endosperm 143, 245 FOXP2 144, 250, 285
dinosaurs 68 – 69 endosymbiont 12, 132, 156, 158 fractal 350
dipeptide 203 –204 enhancer 45 frameshift mutation 178
diptera 116 enolase 180, 361 Franklin, R. 91–92
disease transmission 346 enteritis 205
disulphide 106, 273, 284, 300, 304 –306, enthalpy 200
312 ENTREZ 53, 106, 158 G
exchange 134 entropy 200, 346 –348, 351
dithiothreitol 307 enucleate 143, 291 gag (HIV-1 protein) 124
DNA polymerase 94 –95, 101–102, 192, Enzyme Commission (EC) 313, 351ff. galactose 72, 283, 380, 386
195, 238, 354 eosinophil 291 galactosylceramidase 283
DNA structure 91–93 ephrin 168 –170 Galapagos Islands 183, 259
docking 333 epidemic 122, 126 –127, 145, 159, 207, gamete 82– 83, 121, 132, 140
dodo 42, 119, 232 332, 346 ganglioside 305
dog genome 226ff. epidermal 298 G-banding 84
domain (protein) 137ff., 302ff. epigenetic 3 –5, 16, 39 GC-rich region 346
swapping 335 epilepsy 77 geochemistry 192
Drosophila melanogaster 51, 86, 108, epitope 57, 368 gibberellin 219
116, 143, 332 erythrocruorin 31 Gibraltar 55 –56
development 48, 275 –278, 374 erythrocyte 59 – 60, 291, 320 Giemsa stain 84
genome 99 –100, 122, 136, 147–149, erythropoietin 45 Gilbert, W. 94, 96 –97
285, 352 Escherichia coli glaciation 77, 143, 230
dot plot 168ff. genome 129 gleevec 87
Duchenne’s muscular dystrophy 150 transcriptional regulatory network 378 glia 275 –276, 283
dugong 233 euarchontoglires 29, 227 globin 7, 26 –32, 37–38, 73, 114,
dystonia 51 euchromatin 147 136 –138, 153 –154, 156 –157, 179,
dystrophin 8, 39 eudicot 218, 235, 248 –250, 263 190, 221, 235 –236, 285, 305,
dystrophy 43, 73, 150 euryarchaeota 195 –196 318 –321
euteleostomi 29 allosteric change in haemoglobin
eutheria 29, 157, 227 318ff., 342
E exocytosis 283 haemoglobin 26 –32, 37–38, 44 – 45,
exome 18, 20 47, 59 – 60, 136 –137, 166 –167, 291,
Embden–Meyerhof pathway 204, exon 8ff., 14, 20, 133 298, 302–303
272–273, 358 –359 ExPASy 109 myoglobin 26, 29 –32, 38, 57,
echidna 225 extinction 41– 42, 66 – 69, 71, 118, 192, 136 –137, 319
echinoderm 119, 220 252, 259 glucocorticoid receptor 271, 371
EcoCyc 341, 355 –356, 388 extremophile 205 glucokinase 360
EcoRI 88 – 89, 110 eyeless 277 gluconeogenesis 178, 213, 273
EcoRV 373 glume 243 –245
ectopic 48, 374 glutamatergic 276
effector 137, 319 –320 F glutathione 58, 274 –275
einkorn 141, 242 glyceraldehyde 178, 204, 359 –361
elastase 22, 155 farro 141 glycerol 156, 194 –195, 246
electrophoresis 64 – 65, 72, 88, 96 –97, fava bean 59 glycogen 350
207, 297, 307–308 feed-forward loop 377, 379 glycolipid 305
Index 393
glycolysis 178 –179, 204, 213, 273 –274, histone 7, 11–13, 44 – 45, 129, 154, 194, isologous interaction 335
289 –290, 357–359 251, 373 –374 isopropylthiogalactoside 381, 387
glycophorin 145 HIV 36, 73, 77, 124 –126, 305 isozyme 178 –179, 243
glycoprotein 9, 45, 125 –127, 305, 339 homeobox 148
glycosylation 106, 306 homeodomain 374 –375
golden rice 245 homeostasis 274 –275, 377 J
Gondwanaland 182 homeotic gene 279, 374
G protein 138, 140, 316, 362 homodimer 335 jaundice 59
gramicidin 330 homogametic 77, 224 Jeffreys, A. 238 –239
granulocyte 291 homology 136, 161, 167, 181, 197, 201,
granulysin 145 283, 342, 370
green fluorescent protein 50 modelling 323 –328 K
GroEL 330 homoplasmy 240
GTP 95, 204, 316, 318, 362 homozygote 82, 112 Kabat, E.A. 153
honeybee 49, 68 kallikrein 283
heterochromatin 8, 84, 142, 147–148 Kanehisa, M. 109
H Huntingdon’s disease 22, 89, 284 kangaroo 71, 138, 164, 225, 227
huntingtin 22, 331 keratin 48, 224, 298, 327, 330
haemagglutinin 126 –127 Huxley, H.E. 80 kinase 180, 203 –204, 248, 273, 276,
haematopoiesis 290 –291 hydrocortisone 271 283, 289, 298, 302, 315, 335,
haemocyanin 221 hydrogen bond 91–92, 176, 179, 355 –357, 362, 371, 373
haemoglobin (see globin) 299 –304, 330, 333ff., 373 arginine 314 –316, 337–339
haemoglobinopathy 59, 331 hydrophobicity 134, 156, 179, 200, database 107
haemolytic anaemia 59, 179 299 –300, 325 –327, 331–333, phosphofructokinase 330, 359
haemophilia 43, 240 336 –337 shikimate 360 –361
Haemophilus influenzae 122, 136, 205 hyperpolarization 276 tyrosine 87, 292, 370
haemorrhagic fever 204 hypersensitive 45 korarchaeota 195 –196
hagfish 119 hyperthermophile 196 –198, 200 –202, Krebs cycle 272–274, 290, 354, 356, 388
hairpin 13, 113, 177 205, 212
half-life 130, 381, 387 hypervariable region 239, 262
halophile 118, 195 –196, 361 hypoglycemia 178 L
Hamming distance 171 hypomethylation 16
haplogroup 251–253, 257, 263 hypoxanthine-guanine lactalbumin 324
haploid 53 –54, 83, 98, 121, 143, 216, phosphoribosyltransferase 52 lactamase 16
251 hypoxia 45 lactone 343, 357
haplotype 20, 52–57, 72–73, 75 –77, lactose 4 –5, 42– 44, 47, 72, 241, 283,
79, 85, 89, 110 –111, 229, 237, 380 –381, 385 –387
252–253, 259 I lagomorph 152, 227
HapMap 20 –21, 53 –55, 106, 110, 116, lamprey 119, 221
151, 159, 250 immunocompromised 126 lancelet 108, 119, 220
Havasupai 34 immunodeficiency 125 Last Universal Common Ancestor
hedgehog 152 immunoglobulin 221, 266, 298, 331 (LUCA) 195, 212
Helicobacter pylori 122, 136, 205 –206 immunohistochemistry 268 laurasiatheria 227
hemichordate 119, 220 immunology 57, 117, 119, 126, 206, 224 L-dopa 370 –371
hepatitis 124, 206, 213 immunoprecipitation 64, 367–368 lectin 283, 305 –306
hepatocyte 22 immunosuppression 57 leghaemoglobin 31
herbicide 187 indolepyruvate 204 lemur 152, 159
hermaphrodite 50, 147 influenza 122, 125 –128, 136, 157, lepidoptera 163
Hershey–Chase experiment 81, 93, 205, 305 leprosy 149
124 –125 insecta 116 leptospirosis 205
heterochromatin 8, 84, 142, 147–148 insecticide 187 Lesch–Nyhan syndrome 52, 113
heterodimer 177 insulin 8, 43, 58, 61, 94, 106, 186, 283, leucine-rich repeat 248
heteroduplex 65 306, 364 leucine zipper 327, 374 –375
heterogametic 77, 224 integrase 124, 201 leucocyte 55
heteroplasmy 240 interactome 388 leukaemia 22–23, 87, 110, 143, 150,
heterozygote 54, 58, 65, 72, 82– 83, intercalation 376 268, 290 –292
112 interleukin 154, 248, 263 Levenshtein distance 171
hexanucleotide 15, 348 intron 8ff., 14, 27ff., 124, 129 ligase 275, 351–352
hexokinase 335 iodoacetamide 307 lignin 173, 243, 292
hidden Markov model (HMM) ionization 309 Linnaeus, C. 66, 80, 115 –117, 164, 345,
175 –176, 325 –326 ionotropic 283 353
hippocampus 282–283 isoelectric focusing 308 lipid 130 –131, 194 –195, 202, 232, 275,
hirudin 333 isoform 16, 149, 159 305, 367
394 Index
RFLP (see restriction fragment length sediment 192 solute 105, 200, 204, 268, 299, 373
polymorphism) segregation 81– 82, 140 solvent 298 –300, 308 –309, 328 –329
rheumatoid arthritis 57 selection (in evolution) 54, 58, 80, Sonnhammer, E. 326
rhizaria 217 83 – 84, 123, 134, 136, 140, soybean 42, 108, 248 –249
rhodopsin 48, 140, 195, 362 145 –146, 163, 166, 176, 179, species
riboflavin 204 186 –187, 198, 206, 233, 242–245, difficulty of definition 42, 66, 117,
ribonucleoprotein 47 280, 285, 289 –290, 328ff., 348 120, 164
ribosome 6 –7, 10, 13, 47, 133, 176, selenocysteine 7 endangered, extinct 17, 42, 67ff., 118,
192–194, 304, 330, 335, 342, selenomethionine 304, 306 165, 216, 229ff., 240, 250
352, 374 sepharose 232 sperm 4, 31–32, 122, 143, 159, 218,
rice 108, 133, 187, 242–243, 247 sequence-tagged site 98 245, 248, 251, 263
golden 245 sequencing 18ff. (see also spermidine 159
rickettsia 13, 38, 122, 158, 205, 380 palaeosequencing) spindle 285
RNA 6, 11, 13, 17, 46, 94, 106 –107, decreasing cost 4, 18, 98 spinocerebellar ataxia type 3, 149
118, 121, 123, 125, 126, 129 –131, exome 20 splicing 8 –9, 14, 16, 43, 47, 60, 106,
143, 154, 162, 169, 177, 195, genome sequencing projects 17 124, 162, 178, 219, 242, 304, 307
197–198, 209, 217, 229, (see also individual species) squid 119, 343
268 –269, 271, 282, 342, 352, high-throughput techniques 98ff., src-homology domain 370
371, 374 232ff. staphyloxanthin 289
messenger (mRNA) 6 –7, 9, 14 –15, history of techniques 18ff., 90ff. starfish 119, 220
30, 43 – 44, 46 – 47, 64, 95, serotonin 52 steady state 343, 364 –366
108 –109, 124, 129 –130, 136, serotype 126, 157 stellacyanin 313
176, 198, 266 –268, 277, 285, serpin 58, 314, 320 stem cell 291
377, 380 –381 seven-helical transmembrane protein streptavidin 333, 368
mRNA editing 9, 106, 123, 304 140 streptomycin 155
miRNA and siRNA 7, 16, 47, 154 shikimate dehydrogenase 360 –361 stress 44, 47, 52, 150, 272–274, 276,
ribosomal 7, 117–119, 132, 166, shotgun sequencing 19, 79, 98 –100, 104, 343, 383 –384
192–194, 196, 208, 217, 232, 113, 147, 151, 227, 292–293 oxidative 59, 143, 210, 272–274, 284,
251 sialic acid 127 370, 377
transfer (tRNA) 8, 10, 13 –14, 47, sialidosis 339 stromatolite 194
132, 147–148, 162, 192, 198, sickle-cell anaemia 4, 58 – 60, 331–332 subgraph 345, 377, 380
201–202, 217, 251, 274, 306, sidechain 48, 59, 134, 178, 194, 198, substitution 48, 60, 136, 144 –146, 162,
352, 374 200, 299ff., 314, 320, 323 –324, 171–173, 175, 177–178, 185, 241,
RNA interference 47, 280 –281 326, 330, 333 –334, 375 –376 251, 308, 328 –330, 363 –364, 383
RNA polymerase 16, 44, 376 –377, signal 4 – 6, 8, 14, 44 – 45, 87, 96 –97, (see also, single-nucleotide
380 –381 101, 102, 104, 133, 138 –141, 162, polymorphism)
RNAseq 18, 292–293 170, 202, 217, 220, 244, 273, subtilisin 155, 329 –330
ROBETTA 326 –327 276 –278, 280, 283 –284, 287, 291, sulfolobus 159, 195 –196, 358
298, 305, 312, 316, 342ff., 352, sulphurylase 101–102
356, 362ff., 370ff. supercoil 129, 373
S peptides 326 –327, 332 superfamily 107, 138, 283, 298,
transcriptional control 18, 48 – 49, 86, 310, 312
Saccharomyces cerevisiae 44, 122, 136, 361, 382ff. superoxide dismutase 274
149, 155, 193, 213, 217, 266, signalome 388 superposition (of protein structures or
272–273, 367, 382–383 signal-transduction 220, 312, 345 substructures) 32, 176 –177, 313,
salamander 122 silkworm 108 315, 321, 324
Sali, A. 324 single-end read 18, 37 suppressor 62– 63, 65, 116, 247, 376
Sanger, F. 18, 94ff. single-input motif 379 surface plasmon resonance 367
sarcomere 317–318 single-nucleotide polymorphism 18, 20, sweetpea 81
Sargasso Sea 208 –209 52–55, 60 – 63, 84, 144, 151, SWISS-MODEL 324 –325, 327
Sasisekharan–Ramakrishnan– 177–178, 228 –229, 268 SWISS-PROT 29, 106, 354
Ramachandran plot 301–302 single-stranded synaptogyrin 283
satellite 7– 8, 88, 243, 346 conformational polymorphism test 65 synechocystis 13, 38, 122
scale-free network 346, 377 DNA 94, 97, 125 synonymous mutation 6, 9, 57, 136,
schizophrenia 34, 51–52, 77 RNA 14, 17, 125, 198 162, 198
Schizosaccharomyces pombe 217 sleep 23, 272, 274 –276, 332 synteny 86, 141, 143, 145 –146, 153,
scrapie 332 smallpox 22, 42, 125 162, 201, 206, 221–224, 229,
SDS-PAGE 307–308, 337 SNP (see single-nucleotide 248 –250, 286
secondary structure 178, 192–194, 292, polymorphism) syntrophin 337, 371
301–304, 312–313, 320ff., 325 –327, Solexa 120 synuclein 283, 331
330, 374, 383 Solid (Applied Biosystems sequencer) syphilis 205
prediction 325 –327 102–104 systematics 71, 116, 164 –165, 231, 234
secretory 373 solubility 299 systems biology 341ff.
Index 397
signal 48, 130 –131, 140 –141, 202, 201–202, 206, 208, 211, 243,
T 217, 220, 278, 280, 284, 287, 312, 305 –306, 330, 333, 368, 374
316, 345, 352, 361–362, 370 –371 φX-174 18, 122
Takifugu rubripes 122, 224 transfection 329 HIV-1 57, 124 –126, 305
tandem affinity purification 368 –369 transgenic 51, 145 influenza 122, 125ff., 229, 305
tarsier 159 transhydrogenase 365 vitamin 47, 58 –59, 72, 179, 187, 245,
Tasmanian devil 42, 68 – 69, 119 transition (mutation) 171, 343 278, 345, 357
TATA box 14, 375 transition-state 128, 314 vitellogenin 225
TATA box-binding protein 373 –376 translation 6 –10, 13 –14, 29 –30, 43 – 47,
tau 21, 108, 116 –117, 141, 305, 315, 64, 105 –106, 108, 120, 123, 125,
331, 343 129 –133, 154, 162, 175 –176, 187, W
tauopathies 331 197, 201–202, 276 –277, 284, 348,
taurocyamine 315 374 (see also post-translational wakefulness 247, 272, 274 –276
taxonomy 30, 48, 65 – 66, 117–118, modification) warbler 183
164 –167, 204, 230, 232, 259, 290, translocation 13, 16, 28, 52, 86 – 87, Watson, J.D. 20, 23, 91–93, 111
345, 353 217, 271 whale 31–32, 68, 116, 227, 240
T-cell 57, 221, 290 transmembrane 51, 125, 138, 140, 153, wheat 108, 122, 141, 189, 235,
T-coffee 171, 213 326 –327, 362, 370 242–243
telomere 8, 145 transposition 15 –17, 87, 136, 162 wobble hypothesis 111
teosinte 241–245, 262 transposon 15 –17, 129, 206, 219, 242 Woburn Abbey 70
termination (chain) 8, 60, 95 –96, 177 transthyretin 331 wwPDB 107, 110, 310, 312–313, 325,
tertiary structure 192, 302–304, 312, transversion 171 328, 352
320 –321, 325 trEMBL 106, 354
tetracycline 289 triticale 141
Tetraodon nigroviridis 221–223 triticosecale 141 X
thalassaemia 45, 60, 285 Triticum 108, 141
Theobroma cacao 248ff. trypanosome 8, 217 xanthine 52, 290
theobromine 246 –247, 263 trypsin 22, 43, 58, 120, 155, 308, 331, 333 xenarthra 152, 227
thermitase 329 –330 tryptic digest 308, 310 –311 Xenopus 108, 138, 143, 168, 170, 189
Thermococcus kodakarensis 201ff. tuberculosis 120, 122, 192, 369 xeroderma pigmentosum 150
thermolabile 200 tumour 5, 20, 42, 108, 150, 283, X-linked 43, 52, 58, 73, 86
thermophile 118, 122, 195 –198, 200 290 –291, 316 X-ray tomography 367
thermostability 198, 329 –330 tumour-suppressor 62– 63, 65, 116, 376
thiamin 47 tunicate 119, 221
thioredoxin 134 –135 turkey 103 –104, 112, 232, 246 Y
thrombin 187, 320, 322, 333 two-hybrid screening 367
thylacine 68 – 69, 119 typhoid 205, 346 YAC (yeast artificial chromosome) 11,
thymus 57 typhus 13, 122, 205 96 –97, 178, 204, 297, 307–308,
tiling 266 338, 359 –360
time scale 68 y-chromosome 241
tinamou 182 U yeast 5, 11, 25, 108, 118 –120, 144,
tissue-culture 269 149 –150, 206, 216, 306 –307, 332,
tobacco 108, 125, 368 ubiquitin 8, 39, 44, 305, 333, 373 335, 363, 367, 380
tomato 46, 108, 187 ultraviolet 208, 210 gene expression control 266, 272ff.,
torpor 274 uncoil 147, 374 377, 382ff.
toxin 97, 204, 206, 226 UniProt 25, 29, 106, 213, 236, 352 genome 14, 44, 86, 122–124, 132,
transcriptase, reverse 14 –15, 124 –125, upregulation 371 140, 147, 216ff.
292 urea 34, 236, 307, 359 protein interaction network 370 –371
transcription 6 –9, 11, 14 –16, 18, 26, uric acid 52, 61, 91, 205, 236, 359 Yersinia pestis 205
43ff., 60, 64, 108 –109, 120, urochordata 119, 141, 220 –221, 235 y-ion 311
123 –124, 129 –131, 141, 144, 148, Y-rich region 45
154, 162, 169, 176 –177, 197, 202,
217, 219 –220, 225, 244, 266, 268, V
271ff., 282–285, 287, 289 –292, Z
304, 316, 327, 330, 342–343, 351, vaccine 21–22, 125, 204, 206, 208, 212–213
355 –356, 362–363, 365, 367–368, vacuole 367 zanamivir 127–128, 156
371, 373 –386 vancomycin 206, 287–290, 294 Z-antitrypsin 22, 331
transcription activator 368 varroa mite 68 Z-disks 317
transcriptome 18, 154, 266, 292–294 venom 225 –226 zebra 49, 51, 108, 119, 138, 141, 152
transduction vesicle 232, 283, 371 zebrafish 119, 138, 152
DNA transfer 131 vincristine 292 zinc finger 271, 375
energy 12, 149, 210, 317, 326 virus 9 –11, 15, 17, 22, 42– 44, 47, 67, zipper 327, 374 –375
photo 278 81, 121, 124 –125, 128, 163, 192, Zuckerkandl, E. 118
~StormRG~