0% found this document useful (0 votes)

59 views57 pages

Previewpdf

Uploaded by

AISHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views57 pages

Previewpdf

Uploaded by

AISHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/357575487

Bioinformatics: A Practical Guide to NCBI Databases and Sequence

Alignments

Book · January 2022

DOI: 10.1201/9781003226611

CITATIONS READS

3 6,017

1 author:

Hamid Ismail
Michigan Technological University
89 PUBLICATIONS 332 CITATIONS

SEE PROFILE

All content following this page was uploaded by Hamid Ismail on 18 January 2022.

The user has requested enhancement of the downloaded file.

Bioinformatics
Chapman & Hall/CRC
Computational Biology Series

About the Series:

This series aims to capture new developments in computational biology, as well as high-quality work summar-
izing or contributing to more established topics. Publishing a broad range of reference works, textbooks, and
handbooks, the series is designed to appeal to students, researchers, and professionals in all areas of computa-
tional biology, including genomics, proteomics, and cancer computational biology, as well as interdisciplinary
researchers involved in associated fields, such as bioinformatics and systems biology.

Metabolomics: Practical Guide to Design and Analysis

Ron Wehrens, Reza Salek

An Introduction to Systems Biology: Design Principles of Biological Circuits

2nd Edition
Uri Alon

Computational Biology: A Statistical Mechanics Perspective

Second Edition
Ralf Blossey

Stochastic Modelling for Systems Biology

Third Edition
Darren J. Wilkinson

Computational Genomics with R

Altuna Akalin, Bora Uyar, Vedran Franke, Jonathan Ronen

An Introduction to Computational Systems Biology: Systems-level Modelling of Cellular Networks

Karthik Raman

Virus Bioinformatics
Dmitrij Frishman, Manuela Marz

Multivariate Data Integration Using R: Methods and Applications with the mixOmics Package
Kim-Anh LeCao, Zoe Marie Welham

Bioinformatics: A Practical Guide to NCBI Databases and Sequence Alignments

Hamid D. Ismail

For more information about this series please visit:

https://www.routledge.com/Chapman–HallCRC-Computational-Biology-Series/book-series/CRCCBS
Bioinformatics
A Practical Guide to NCBI Databases
and Sequence Alignments

Hamid D. Ismail
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by CRC Press
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
CRC Press is an imprint of Taylor & Francis Group, LLC
© 2022 Hamid D. Ismail
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity
of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in
this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been
acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under US Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic,
mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or
retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermissions@
tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation
without intent to infringe.
Library of Congress Cataloging‑in‑Publication Data
Names: Hamid, Ismail D., author. | National Center for Biotechnology Information (U.S.)
Title: Bioinformatics : a practical guide to NCBI databases and sequence alignments / Ismail D. Hamid.
Description: First edition. | Boca Raton : CRC Press, 2022. |
Series: Chapman & Hall/CRC computational biology series |
Includes bibliographical references and index.
Identifiers: LCCN 2021036517 | ISBN 9781032123691 (hardback) |
ISBN 9781032128740 (paperback) | ISBN 9781003226611 (ebook)
Subjects: LCSH: Bioinformatics.
Classification: LCC QH324.2 .H35 2022 | DDC 570.285–dc23
LC record available at https://lccn.loc.gov/2021036517
ISBN: 978-1-032-12369-1 (hbk)
ISBN: 978-1-032-12874-0 (pbk)
ISBN: 978-1-003-22661-1 (ebk)
DOI: 10.1201/9781003226611
Typeset in Minion
by Newgen Publishing UK
Access the companion website: http://hamiddi.com/
Contents

Preface, vii
Acknowledgments, ix
Author bio, xi

Chapter 1 | The Origin of Genomic Information 1

Chapter 2 | The Sources of Genomic Data 39

Chapter 3 | The NCBI Entrez Databases 77

Chapter 4 | NCBI Entrez E-Utilities and Applications 265

Chapter 5 | The Entrez Direct 289

Chapter 6 | R and Python Packages for the NCBI E-Utilities 343

Chapter 7 | Pairwise Sequence Alignment 383

Chapter 8 | Basic Local Alignment Search Tool 407

INDEX, 453

v
Preface

I t has been a while since I began thinking of writing a bioinformatics book that introduces both theory
and practice. I have been practicing bioinformatics for a decade and, during this time, I read tens of bio-
informatics books of quality writing and contents. However, me as a reader, as many of my colleagues and
students, were striving to find a comprehensive and self-contained book that adds both theory and applica-
tion. The bioinformatics is a field of applied science where both theory and practice shall coexist side by side
and therefore an ideal bioinformatics book should provide a good introduction and be clearly written with
informative illustrations and provide sufficient depth and breadth to serve as a valuable reference for students
and researchers. Thus, from the beginning I set a very clear goal to write a book that blends both components
so it can serve as a quick reference for biologists.
The National Center for Biotechnology Information (NCBI) is a resource for molecular biology informa-
tion worldwide. It sets its mission to develop new information technologies to aid in the understanding of fun-
damental molecular and genetic processes that control health and diseases. To achieve this mission, the NCBI
has developed a variety of automated systems for storing and analyzing knowledge about molecular biology,
biochemistry, and genetics. Thus, the NCBI’s databases are the most comprehensive resources of biological
data; they integrate the most major databases worldwide; and they are the most commonly used biological
databases by biologists from around the globe.
Biological databases play a key role in the bioinformatics that modern biology relies on. They offer scientists
the opportunity to access a wide variety of experimentally driven biological data and relevant information that
can be used in genomic research and knowledge discovery in a wide range of applications in bioscience and
biomedicine.
This book provides the basics of bioinformatics and in-depth coverage of NCBI databases, sequence
alignment, and NCBI Sequence Local Alignment Search Tool (BLAST). As bioinformatics has become essen-
tial for life sciences, the book has been written specifically to address the need of a large audience, including
undergraduates, graduates, researchers, healthcare professionals, and bioinformatics professors who need to
use the NCBI databases, retrieve data from them, and use BLAST for finding evolutionarily related sequences,
sequence annotation, construction of phylogenetic tree, finding conservative domain of a protein, just to name
a few. Technical details of alignment algorithms are explained with a minimum use of mathematical formulas
and with graphical illustrations. The uniqueness of this book is that it provides readers with the most used
bioinformatics knowledge of bioinformatics databases and alignments, including both theory and application,
via illustrations and worked examples. The book also discusses the use of Windows Command Prompt, Linux
shell, R, and Python for both Entrez databases and BLAST.
The most important aspects of bioinformatics included in this book, and in-depth and up-to-date coverage
of the NCBI databases and BLAST, make this book an ideal textbook for all bioinformatics courses taken by
students of life sciences and for researchers wishing to develop their knowledge of bioinformatics to facilitate
their own research. The chapters are organized to be progressive from basic to more complex but they are also
linked and self-contained so that readers will find all the information they need to understand the content.
Chapter 1 discusses the basics of bioinformatics including eukaryote, prokaryote, and virus genome, com-
position and structures of DNA, RNA, genes, and protein, and the processes that are involved in biomolecule

vii
viii | Preface

formation and structure, such as transcription, translation, and mutations. The purpose of this introductory
chapter is to serve as a foundation for bioinformatics in general, but it also aims to provides the readers with
the knowledge to understand the origin of the genomic data that are the major elements of the biological
databases.
Chapter 2 continues setting the fundamental background knowledge for understanding the generation of
genomic data from biological samples and the file formats used in bioinformatics and to store the sequence
data on the genomics databases. It discusses the laboratory techniques that generate the experimentally
driven genomic data, which are the very basic elements of genomics and genomics databases. The techniques
discussed include polymerase chain reaction (PCR), first generation sequencing, next generation sequencing,
and the file formats, including FASTA, FASTQ, SAM/BAM, VCF, and AGP.
Chapter 3 covers the NCBI’s Entrez databases in detail. The discussion includes the use of the database,
database indexed fields, construction of search queries, results pages, the content of record pages, and linked
tools. This chapter aims to be practical, and readers can practice while reading to get the full benefit. The use
of each database is illustrated with screenshots of the results pages and individual record pages accompanied
with discussion of the content. This chapter also serves as introductory to Chapters 4, 5, and 6.
Chapter 4 also discusses the Entrez programming utilities (E-utils), which are programming application
interfaces (APIs) to the NCBI Entrez databases and enables accessing the databases from computer program-
ming languages. This chapters aims to provide readers with the basic programs of the E-utilities so programmers
can use them and regular users will understand the concept for the applications from other programs.
Chapter 5 covers the Entrez Direct (EDirect), which is a set of command-line programs wrapping around
the E-utils functions. The readers will learn how to install and use EDirect on Windows Linux and Mac OS
environments. The chapter also discusses how to use Linux bash shell command and scripting language to
extend the use of EDirect function for data retrieval and extraction.
Chapter 6 discusses the use of R packages (R Entrez) and Python (BioPython) for searching and retrieving
data from the NCBI databases. Both R and Python are free and open-source and the most used programming
scripting languages in the classroom and life science. Accessing Entrez databases from inside these two pro-
gramming environments will help in developing desktop and web applications or in the advanced research
works that require integration of different resources.
Chapter 7 discusses the basics and algorithms of the global and local alignments. It walks the reader through
the steps of the dynamic programming for the sequence alignment. It also illustrates the algorithms of the
regular BLAST and PSI-BLAST and demonstrates the steps of computing the position-specific scoring matrix
(PSSM), which is used for motif discovery and detection of conserved regions in protein sequences. The aim of
this chapter is to provide readers with the background knowledge on sequence scoring schemes, substitution
matrices, and the sequence alignment algorithms.
Chapter 8 covers the uses of BLAST programs. In the first part, the chapter discusses the uses of web-based
BLAST with worked examples of the most frequent scenarios, such as finding related sequences, construc-
tion of phylogenetic tree, finding conserved domain and structure in proteins, and sequence annotation. The
second part of this chapter focuses on the stand-alone BLAST, which is essential for users who run searches
exceeding the NCBI allowed limit. The chapter illustrates how to install and use the stand-alone BLAST and
how to create a custom BLAST database.
Acknowledgments

I would like to thank Dr Amani Babekir for her help in reviewing the chapter of this book several times
and providing useful comments that add to the quality of the content. I would like also to thank Amna
Ismail who helped me in drawing the figures of this books and designing the cover image. My thanks also
extend to both Dr Cory Brouwer and Dr Roshonda Jones for reviewing the book proposal and their encour-
aging feedback.

ix
newgenprepdf

Author bio

Hamid D. Ismail earned his Ph.D. and M.Sc. in computational science from North Carolina Agricultural and
Technical State University, USA, and Doctor of Veterinary Medicine (DVM) and Applied Diploma in Statistics
and Programming from the University of Khartoum, Sudan. He has worked as an adjunct professor, bioinfor-
matics specialist, bioinformatics tool developer, senior scientist, statistician, and database consultant with sev-
eral institutions. He is currently a Ph.D. Research Associate at North Carolina Agricultural and Technical State
University. Dr. Ismail is a published author in bioinformatics, statistics, machine learning, and bio-science,
and he has developed a number of bioinformatics desktop and web-based applications. He is also a reviewer
for a number of academic journals.

xi
CHAPTER 1

The Origin of Genomic Information

INTRODUCTION
The diversity of life, from a simple organism like bacteria to the largest animals, and the diversity of individ-
uals within a species, are guided by biomolecules inside the living cells called deoxyribonucleic acid (DNA).
The DNA molecule is formed of only four basic monomeric units known as DNA nucleotides composing of a
phosphate group, a sugar, and four different types of nucleobases or simply bases (adenine, cytosine, guanine,
and thymine). In bioinformatics, those four units are given the letters: A, C, G, and T respectively. The DNA
molecules in a living cell are represented as sequences of those four nucleotides forming the genome. Viruses
usually have small genomes; Bacteriophage spp has a median total length of 8689 bases (8.689 kb). The smallest
non-viral genome is that of a bacterium known as Carsonella ruddii, which has a genome of 164,376 bases
(164.376 kb). The total length of the human genome is 3,272,090,000 bases (3,272.09 Mb). Segments of DNA
known as genes control the different aspects of life of a living organism by instructing the cells to synthesize the
proteins, which do most of the work in cells and are required for the structure, function, and regulation of the
body tissues and organs. The instructions are transcribed into ribonucleic acid (RNA), which is translated into
a specific protein. The two-step process (transcription and translation) by which the information in gene flows
into proteins is known as the central dogma of molecular biology. The information in the DNA is also trans-
mitted from one generation to another. The new generation of a living organism inherits characteristics due to
DNA transmission from parents. The diversity in life is attributed to the ability of the DNA to change slowly
in search of better traits to adapt with changes in nature. Such changes or mutations contribute to the diversity
in life. Advancement in molecular biology and biotechnology made possible the capturing of the informa-
tion carried by DNA, RNA, and proteins. Sequences and other biological information from diverse species
and individuals within the species of organisms are now increasingly deposited by researchers and institutions
onto bioinformatics databases to be available for retrieval and analysis for research purposes. The genomic
information has revolutionized biology and made modern biologists dependent on bioinformatics, which uses
computer science to store, organize, search, manipulate, and retrieve the genomic information. Institutions
like the National Institute of Health (NIH), the European Molecular Biology Laboratory (EMBL), and the
Japanese Institute of Genetics contributed largely to the progress made in bioinformatics. Together, those three
institutes formed the International Nucleotide Sequence Database Collaboration (INSDC) [1], which is a joint
effort to collect and disseminate databases containing DNA and RNA, and protein sequences. The INSDC
includes GenBank (USA), the European Nucleotide Archive (UK), and DNA Data Bank of Japan (Japan). Those
three partners capture, preserve, share, and exchange a comprehensive collection of nucleotide sequences and
associated information on a daily basis. The INSDC policy allows public access to the global archives of nucleo-
tide data generated in publicly funded experiments. The submission of this genomic data is instrumented by the
fact that it is a pre-requisite for publication in scholarly journals. The database records are publicly available for
scientists from all over the world to access, analyze, draw conclusion, and publish their findings.
DOI: 10.1201/9781003226611-1 1
2 | The Origin of Genomic Information

Before digging deep, it is important to discuss some basics in genomics that will help readers to under-
stand bioinformatics. The foundation of bioinformatics is built on the data that represents the flow of genomic
information from the DNA, onto RNA, and proteins. Therefore, understanding the composition of these three
kinds of biomolecules, gene structure, gene transcription and expression, mutation, and techniques used to
obtain such genomic data is fundamental for understanding the biological databases and other bioinformatics
applications.

GENETIC INFORMATION AND ITS TRANSMISSION

In the traditional Linnaean system of classification, living organisms are classified on the basis of cellular
organization and methods of nutrition into five kingdoms: Monera (bacteria), Protista (protozoans and
algae), Fungi (funguses), Plantae (plants), and Animalia (animals). A modern taxonomic classification has
been made to extend the Linnaean system to consider genomic characteristics. Nowadays, biologists rec-
ognize only two vastly different cell types, prokaryote and eukaryote, based on the absence or presence of a
membrane-bound nucleus containing the genetic material of the cell. Therefore, a living organism is either
prokaryotic or eukaryotic [2, 3]. The prokaryote includes unicellular organisms that do not have a true nucle-
olus or membrane-bound organelles (Figure 1.1a). Prokaryote includes bacteria, which is the most abundant
organism, and archaea, which are inhabitants of the most extreme environments on the planet, such as high
temperature, alkaline, or acid waters. The eukaryote comprises all other living organisms (animals, plants, and
fungi) whose cells possess clearly defined nucleus and organelles (Figure 1.1b).
The genetic information of any organism is carried by DNA, which is the hereditary material in living
organisms. Nearly every cell of an organism has the same DNA molecules. Nuclear DNA molecules are
known as chromosomes [4]. The most prokaryotic cells may possess a single circular chromosome, two cir-
cular chromosomes, one circular and one linear, or one linear chromosome free-floating in the cell cytoplasm.
Circular chromosomes have no free ends. Some bacteria have small DNA molecules outside the chromatin
body known as plasmids, which are independent from the chromosomal DNA. In eukaryotic cells, the DNA
molecules or chromosomes are found inside a defined nucleus. Each somatic cell has the same number of
chromosomes. In animal cells, a small amount of DNA is also found in the mitochondria (mitochondrial
DNA or mtDNA). In plant cells, a small amount of DNA is found in the chloroplasts. Generally, each organism
has a specific number of chromosomes (Table 1.1). The set of all DNA molecules in a single somatic cell
of an organism (including mitochondrial DNA in animals and chloroplast DNA in plants and plasmids in
bacteria) is called genome. The prokaryotic genome is haploid, which is formed of a single set of unpaired
chromosomes (Figure 1.2a) while the chromosomes of eukaryotic cells are linear and diploid, which exist in
pairs (Figure 1.2b); each one of the parents (father and mother) contributes one chromosome to each pair so
that offspring will get half of their chromosomes from their mother and half from their father.
In complex eukaryotes, each organism has specific numbers of chromosomes, of which one pair is the sex
chromosome (X and Y) and the remaining are non-sex chromosomes known as autosomes. For example,

FIGURE 1.1 (a) Bacterial (prokaryotic) cell and (b) eukaryotic cell.
The Origin of Genomic Information | 3

TABLE 1.1 Number of Chromosomes in the Somatic Cells of Some Organisms

Organism Number of Chromosome Pairs

Human 23
Chimpanzee 24
Cow 30
Sheep 27
Goat 30
Horse 32
Dog 39
Cat 19
Rice 12
Bean 11
Tobacco 24

FIGURE 1.2 (a) A bacterial chromosome (b) eukaryotic chromosome.

a human somatic cell has 23 pairs of chromosomes, of which 22 pairs are autosomes, and one pair is the
sex chromosome (X and Y). Each eukaryotic chromosome is made up of DNA tightly coiled many times
around histones and it has a constriction point called the centromere, which divides the chromosome into two
sections called arms. By convention, the shorter arm of the chromosome is known as the p-arm and the longer
arm of the chromosome is known as the q-arm. The location of the centromere on each chromosome gives
the chromosome its characteristic shape and can be used to describe the location of specific genes. The ends of
chromosomes are protected by telomeres.
The DNA molecules in chromosomes wrap around complexes of histone proteins giving the chromosomes
compact compressed shapes so that the DNA molecules can fit in the nucleus. Without such packaging, DNA
molecules would be too long to fit inside cells.
There are different imaging techniques that can be used in studying the structure of chromosomes [5].
These techniques include light microscopy, fluorescence microscopy, electron microscopy, and coherent X-ray
diffraction imaging. Staining is usually used to enhance the contrast between cellular components allowing
the proper visualization of chromosomes with different imaging techniques. Differential staining along the
length of a chromosome leads to the production of clearly visible bands (cytogenetic bands), which provide
information about the chromosomes. Each chromosome has its own unique pattern bands. Therefore, they
can be used to identify individual chromosomes, in cytogenetic location of a gene, and to study abnormalities
in the chromosome due to deletions, insertions, or translocations.
The cytogenetic location consists of a chromosome number (in the human 1-23, X or Y), the arm of the
chromosome (p or q), and the position of the gene on the p or q arm. The position is usually designated by two
digits representing a region and a band. The two-digit region is sometimes followed by a decimal point and
one or more additional digits representing sub bands within a light or dark area. For example, the cytogenetic
location of BRCA1 gene in the human somatic cell is (17q21.31), which represents the address of BRCA1 gene
4 | The Origin of Genomic Information

FIGURE 1.3 Human chromosome 11 showing banding patterns.

as position 21 and sub band 31 on the long arm of chromosome 17. Chromosomes are represented graphically
by chromosome ideograms, which are used to show the relative size of the chromosomes of an organism and
their characteristic banding patterns generated by staining (Figure 1.3).
In the most complex organisms, like the human, a copy of chromosome pair is inherited from the mother
(ovum) and the other pair from the father (sperm). The offspring inherit some of their traits from their mother
and others from their father. The pattern of inheritance is different for the circular mitochondrial DNA. Only
ova keep their mitochondria during fertilization. Therefore, mitochondrial DNA is always inherited from the
female parent.
In complex animals like the human, the ova in the female and sperms in the male are formed in specialized
organs (ovaries in females and testes in males) by cell division called meiosis, which produces four different
daughter cells from a single cell (Figure 1.4) [6]. Each daughter cell is haploid that contains only one copy of
each chromosome pair. Generally, all male and female gametes, which are the products of meiosis cell division,
are haploid.
Fertilization occurs when the nucleus of a male gamete (sperm) combines with the nucleus of a female
gamete (ovum). In flowering plants, the male gamete is a cell in the pollen grain and the female gamete is
an egg cell in the ovule. When the male and female gametes combine, the resulting cell is called a zygote
(Figure 1.5) [7, 8].
Embryonic development, also known as embryogenesis, refers to the steps of development and formation
of the embryo from a fertilized ovum to a mature embryo (Figure 1.6). Embryogenesis is characterized by
the processes of mitosis cell division (after fertilization) and cellular differentiation of the embryo that occurs
The Origin of Genomic Information | 5

FIGURE 1.4 Cell division forming male sperms (top) and female egg (bottom).

FIGURE 1.5 Fertilization in plant and animal.

during the early stages of development. Mitosis is a type of cell division that results in two daughter cells for
tissue growth. Each daughter cell will have the same number and kind of diploid chromosomes as the parent
cells [9].
After fertilization, the genetic material of the haploid sperm and haploid ovum then combine to form a
single diploid cell called a zygote and the germinal stage of development kicks off. In the human, embryonic
development covers the first eight weeks after fertilization. After nine weeks the embryo is turned into a fetus.
The gestation or pregnancy period for human is around 40 weeks (nine months).
In the case of bacterial cells, reproduction takes place using binary fission, which is basically an asexual cell
division where a single parent bacterial cell replicates its circular DNA and plasmids before splitting into two
new identical daughter cells (Figure 1.7) [10].
6 | The Origin of Genomic Information

FIGURE 1.6 Embryonic and fetal development.

FIGURE 1.7 Bacterial cell division.

STRUCTURE OF DNA AND GENOME

In the previous section, we discussed how the genetic information of an organism is stored in DNA molecules or
chromosomes and transformed from parents to offspring sexually in eukaryotes and asexually in prokaryotes.
In this section, we will discuss the chemical structure of the genetic information and how it is transcribed and
translated into functions that drive and characterize the life of an organism.
The Origin of Genomic Information | 7

Deoxyribonucleic acid or DNA was first observed by the Swiss biochemist Friedrich Miescher in the late
1800s. The structure of DNA was resolved by James Watson and Francis Crick in 1953 [11]. DNA is made
up of four molecules called nucleotides. Each nucleotide contains a phosphate group, a sugar group, and
a nitrogenous base. The phosphate group and the sugar group called deoxyribose are the same for all four
nucleotides (Figure 1.8). However, each nucleotide has a different nitrogenous base. Those four nitrogen
bases that distinguish the four nucleotides are adenine (A), thymine (T), guanine (G), and cytosine (C).
DNA consists of two strands that wind around each other like a twisted ladder. Each strand has a backbone
made of alternating deoxyribose sugars and phosphate groups that form the phosphodiester bond.
The phosphodiester bond is the linkage between the 3` carbon atom of a deoxyribose molecule of one
nucleotide and the 5` carbon atom of a deoxyribose molecule of another nucleotide (Figure 1.9).
The four nitrogenous bases give the four nucleotides their characteristics. Adenine (A) and guanine (G) are
purine bases, which are structures that are composed of a five-sided and six-sided ring (Figure 1.10).
Cytosine (C) and thymine (T) are pyrimidines, which are structures composed of a single six-sided ring
(Figure 1.11).
In DNA, any of the nitrogenous bases (A, C, G, and T) forms a glycosidic bond between its 1` nitrogen
and the 1` -OH group of the deoxyribose forming a single DNA strand. The two DNA strands are formed

FIGURE 1.8 (a) Phosphate group and (b) deoxyribose sugar.

FIGURE 1.9 Formation of phosphodiester bond between deoxyribose sugars.

FIGURE 1.10 Purines adenine (left) and guanine (right).

8 | The Origin of Genomic Information

FIGURE 1.11 Pyrimidines cytosine (left) and thymine (right).

FIGURE 1.12 Duple helix of DNA.

by hydrogen bonds between the nitrogenous bases, where adenine (A) always binds to thymine (T), while
cytosine (C) and guanine (G) always bind to one another forming the duple helix structure of the DNA
(Figure 1.12). This relationship between any two bases (A and T) and (C and G) is known as complementary
base pairing.
The order of the nitrogenous bases in the genome of a specific organism determines DNA instructions, or
genetic codes, for building and maintaining the organism. The determined order of the four bases that make
up the genome or a segment of the genome is called a DNA sequence.
The nucleotides in a DNA sequence are arranged in the direction from the 5-prime end, which has a free
hydroxyl (or phosphate) on a 5` carbon to the 3-prime end, which has a free hydroxyl (or phosphate) on a 3`
carbon. The DNA strand 5` → 3` is known as the plus “+” DNA strand and its complementary strand is known
as the minus “-” DNA strand.
In bioinformatics, the nucleotide sequence of a piece of DNA is represented using the one-letter abbreviations
of the four bases A, T, C, and G to identify the order of the nucleotides in the DNA molecule or a fragment of
the DNA molecule. The DNA representation is usually for the plus DNA strand (from 5` end to 3` end). The
complementary strand of a DNA sequence can be easily predicted from the other strand. For example, assume
that the following is a representation of a DNA sequence:

CTCACTGA
The Origin of Genomic Information | 9

Each letter represents a nucleotide base (A for adenine, C for cytosine, G for guanine, and T for thymine). As
A will bind to T, and C will bind to G and vice versa, the complementary strand for the above DNA sequence
will be:

GAGTGACT

The duple-stranded DNA can be represented as

Such representation of DNA sequence is the basis for the genomic data that the field of bioinformatics
depends on.
Not all the genomic DNA sequence is functional; only small portion of it contains functional units called
genes. A gene is the basic physical and functional unit of heredity in the genome. Genes vary in length from
a few hundred DNA bases to millions of bases. Some genes carry the instructions to make proteins, which do
most of the work in cells and are required for the structure, function, and regulation of the body tissues and
organs. However, many genes do not code for proteins but may code for other biomolecules such as transfer
RNA (tRNA), ribosomal RNA (rRNA), and microRNA, which are components of translation machinery and
gene regulation. Scientists usually use sequence information to determine the genes and other sequence units
of the genome and their functions and implication in traits and diseases. The Human Genome Project (HGP)
estimated that humans have between 20,000 and 25,000 genes [12, 13]. Most genes are the same in individ-
uals of the same organism, but a small number of genes may be slightly different among individuals due to
mutations. In complex organisms, since chromosomes are in pairs, one pair is inherited from the mother and
the other from the father, an individual will have two copies (versions) for each gene (a copy from the mother
and another from the father). Those two copies for the same genes are called alleles. The two alleles of a specific
gene may be identical and having the same sequence (homozygous) or may not be identical (heterozygous).
The difference is usually due to mutation (substitution, deletion, or insertion), which contributes to variation
among individuals within the same organism. A genotype is defined as all genes passed onto an individual
by their parents. However, not all the genes passed to the offspring are translated into visible traits. The set of
physical characteristics an individual has is called a phenotype. Any gene has two alleles (versions). An allele is
either dominant, which shows its phenotypic effect, or recessive, which does not show its effect in the presence
of a dominant allele. The dominant allele masks the effect of a recessive allele and it is denoted by a capital
letter (A) versus small letter (a) for the recessive allele. In complex eukaryotes, since each parent provides one
allele, the possible combinations are: (AA), (Aa), and (aa). The offspring whose genotype is either (AA) or
(Aa) will have the dominant trait (A) expressed as a phenotype, while the individuals with (aa) will express
the recessive trait. However, inheritance can also be codominance, which is a form of inheritance wherein the
alleles of a gene pair in a heterozygote are fully expressed. As a result, the phenotype of the offspring is a com-
bination of the phenotype of the parents. The AB blood type is an example of codominance in humans. A trait
or phenotype may be influenced by multiple genes. Such trait is called polygenic trait.
Genes are given unique gene names and gene symbols following the guidelines published by the Gene
Nomenclature Committee, which is the organization that sets the standards for gene nomenclature [14, 15].
For example, a gene on human chromosome 17 that has been associated with the suppression of breast tumor
is known as Breast cancer type 1 susceptibility gene and its gene symbol is BRCA1.
The HGP completed the sequencing of the human genome for the first time in April 2003. The sequencing
was followed by genome annotation, which is the process of identifying the locations of genes and the
10 | The Origin of Genomic Information

TABLE 1.2 Genomes of Some Organisms (genome size in Mb= millions of base pairs)
% Protein-Coding Number of Protein-
Organism Genome Size (Mb) Sequence Coding Genes Number of Proteins

Homo sapiens 2893.5 1.2% 19,593 121,425

Drosophila melanogaster 137.577 10% 13,719 30,717
Caenorhabditis elegans 102.915 25% 19,832 28,350
Saccharomyces cerevisiae 12.1571 70% 5,983 5,409
Escherichia coli 5.12122 88% 4,240 4,741
Haemophilus influenzae 1.8477 90% 1,713 1,716
Arabidopsis thaliana 1.86401 25% 27,442 27,334

protein-coding regions in a genome and determining what those genes do. The genome sequences of many
organisms have been determined following the sequencing of the human genome. Their sequences provide
interesting comparisons to that of the human genome and are proving useful in facilitating studies of
different model organisms and in identifying a variety of different types of functional sequences, including
regulatory elements that control gene expression. The genome sequences of humans and chimpanzees are
about 99% identical. The genome of Neandertals, our closest evolutionary relative, has also been recently
sequenced. It is estimated that Neandertals and modern humans diverged about 300,000–400,000 years ago
[16]. The genomes of Neandertals and modern humans are more than 99.9% identical, significantly more
closely related to each other than to chimpanzees. The number of genes in eukaryotic genomes is simply
related to neither genome size nor biological complexity. The genome of the small plant Arabidopsis thaliana
is only about 5% the size of the human genome, it contains approximately 26,000 protein-coding genes,
compared with around 20,000 in the human genome. The discrepancies between the eukaryotic genome size
and the number of protein-coding genes are because the genomes of most eukaryotic cells contain not only
protein-coding sequences but also large amounts of DNA that does not code for proteins called non-coding
sequence.
Table 1.2 shows the genome sizes, sizes of protein coding regions, and the number of proteins for some
organisms. The data was obtained from the NCBI Genome database (last updated: January 27, 2020).
Generally, the genome of any organism can include two types of sequences: coding sequences or genes and
non-coding sequences.

Gene Structure
The genes are genome sequences that contain genetic information or codes that can be transcribed into instruc-
tion or RNA (tRNA, rRNA, microRNA, and mRNA). RNAs are functional biomolecules that are expressed
to carry out specific biological functions. Although the term gene is frequently used for only those DNA
sequences that are transcribed into messenger RNA (mRNA) and translated into proteins, it can be used in a
broader context to include any DNA sequences that contain code for performing biological functions in the
cell. Thus, the gene can be defined as a genomic DNA sequence that represents a functional unit and can be
either protein coding genes or non-protein-coding genes. Protein-coding genes are transcribed into mRNA
molecules that carry the coding sequences or transcripts for protein synthesis. Non-protein-coding genes are
transcribed into the other types of RNA.
Since the gene is a functional unit, it is different from non-genic DNA sequences. The gene structure
determines when and where the gene will be expressed and the amount of gene expression and the products.
The DNA sequence of a functional gene is divided into certain regions. The gene sequences of the eukaryotic
organism have three types of regions –the regulatory region, introns (non-coding region), and exons (coding
region) (Figure 1.13) [17, 18].
The gene structure of prokaryotic cells (bacteria and archaea) is simpler than that of eukaryotic cells as they
lack introns.
The Origin of Genomic Information | 11

FIGURE 1.13 Gene structure.

Gene Regulatory Region

The regulatory region of a gene regulates the activation and suppression of a gene. Although the somatic cells
of an organism have the same genome and the same set of genes, each type of cells (e.g., hepatocytes, neuron,
white blood cells, etc.) may have its own gene expression patterns (some genes are on and some are off) to
maintain their functions. Some genes are required in almost all cell types; therefore, they are expressed in all
cells to perform essential biological activities such as DNA and protein synthesis, glycolysis, etc. These genes
are called housekeeping genes and they require a regulatory network or machinery that keeps them on in
almost every cell. Other kinds of genes are tissue-specific, and their expression is confined to cells of a specific
tissue. Those genes are the ones that distinguish the cells of a tissue from others.
The genes are usually activated by special kinds of proteins called transcription factors (TFs), which are
proteins that bind to the regulatory region of the genes to activate transcription of that gene into RNA. There
are also some proteins called suppressors that can also bind to the regulatory region of the gene to suppress
gene transcription and turn it off. The gene expression is usually regulated through complex pathways involving
numerous factors.
The regulatory region of a gene is the DNA sequence of the gene where TFs, RNA polymerase, and
suppressors can bind and interact to control the gene expression. The gene expression is regulated by acti-
vation or suppression of the gene transcription depending on the cell needs. In gene activation, a TF usually
binds to the regulatory region of the gene. The TF binding then allows RNA polymerase to bind to the regula-
tory region to transcribe the gene into RNA. The process of RNA transcription is called gene expression. The
regulatory region of the eukaryotic gene consists of enhancer, core promoter, and proximal control elements.

Enhancer In some eukaryotic genes, the enhancer region enhances the transcription of the gene by assisting
the core prompter to bind to a TF. The enhancer may be located upstream of a gene, within the coding region
of the gene, downstream of a gene, or may be thousands of nucleotides away. The enhancer sequence is a
binding site for transcription factors. The enhancing process is initiated when a DNA-bending protein binds
to the enhancer sequence of the gene and bends it. This bending allows the interaction between the activators
(bound to the enhancers) and the transcription factors (bound to the promoter region) and the RNA poly-
merase to occur (Figure 1.14) [19].

Core Promoter The core promoter (Figure 1.15) is the region of the gene where the general TFs and RNA poly-
merase II (RNA Pol II) are assembled for the initiation of transcription of RNA. The core promoter sequence is
about 25–40 base pairs upstream of the transcription start site (TSS) of the coding region. The core promoter
may contain the following sequence elements, which act as recognition sites for the binding of the TFs.

(1) TATA box is a DNA sequence that indicates where a genic sequence can be read and decoded. The
sequence of TATA box region is rich in thymine and adenine (T/A-rich sequence) with the consensus
sequence, which is the most frequent nucleotide sequence, as 5`-TATAWAAR-3` (see Table 1.3 for
12 | The Origin of Genomic Information

FIGURE 1.14 Enhancer.

FIGURE 1.15 Promoter.

TABLE 1.3 IUPAC Degenerate Nucleotide Symbols [20]

Symbol Description Nucleotides

R Purine A or G
Y Pyrimidine C or T
W Weak A or T
S Strong C or G
M Amino A or C
K Keto G or T
H Not G A, C, or T
B Not A C, G, or T
V Not T A, C, or G
D Not C A, G, or T
N Any A, C, G, or T

the nucleotide symbols). The TATA box is usually located about 25-35 base pairs upstream of the
TSSs (Figure 1.15). In bacteria, this region is called the Pribnow box, which has a shorter consensus
sequence.
(2) Initiator (Inr) element is located from -2 to +4 overlapping the TSS. For example, the consensus
sequence of Inr is 5`-YYANWYY-3` in humans and 5`-TCAKT in Drosophila (see Table 1.3 for the
nucleotide symbols). The Inr works by enhancing binding affinity and strengthening the promoter.
(3) TFIIB recognition element (BRE), which is located at -32 to -37 and its consensus sequence is 5`-
SSRCGCC-3` (see Table 1.3). It can have positive or negative effects on transcription in a promoter
context-dependent manner.
The Origin of Genomic Information | 13

(4) Motif ten element (MTE) is a conserved element with consensus sequence 5`-CSARCSSAACGS-3`
(see Table 1.3). It promotes gene transcription by RNA polymerase II.
(5) Downstream promoter element (DPE) is located at +28 to +32. The DPE functions cooperatively with
the Inr for the binding of TFIID in the transcription of core promoters in the absence of a TATA box.

Proximal Control Elements The promoter proximal control element (PPE) is within around 100 nucleotides
upstream from the core promoter. It stimulates transcription of the gene by interacting with TFs. The number,
identity, and location of the proximal element vary from gene to gene. The PPE may contain a CAAT box,
which is a distinct pattern with GGCCAATCT consensus sequence that occurs upstream by 60–100 bases to
the TSS.

Introns
Introns are non-coding sections of eukaryotic genes. There are no introns in prokaryotic genes. They are
transcribed in the mRNA, but they are removed before mRNA is translated into a protein. Introns can range
in size from tens to thousands of base pairs and can be found in a wide variety of genes that generate RNA in
most living organisms, including viruses. It is vital that introns be removed precisely, as any left-over intron
nucleotides, or deletion of exon nucleotides, may result in a faulty protein being produced. This is because the
amino acids that make up proteins are joined together based on codons, which consist of three nucleotides.
An imprecise intron removal thus may result in a frame shift and the genetic code would be read incorrectly.

Exons
Exons are coding sections of a gene. In eukaryotic genes, after the removal of introns in the transcribed RNA,
exons are spliced together to form gene transcripts, which include the codons that code for the amino acids
of the proteins. The genetic code is the set of rules by which information encoded in RNA is translated into
proteins (Table 1.4). The gene transcript is composed of tri-nucleotide units called codons, each coding for a
single amino acid. The protein-coding gene transcript contains an open reading frame (ORF). The ORF is a
continuous stretch of codons that begins with a start codon (usually ATG) and ends at a stop codon (usually
TAA, TAG, or TGA) and the codons that are translated into amino acids are between the start codon and stop
codon. In eukaryotic genes, following transcription, immature strands of messenger RNA, called pre-mRNA,

TABLE 1.4 Genetic Code for the Amino Acids Forming Proteins [21]
Second Position

First Position T C A G Third Position

T Phe Ser Tyr Cys T

Phe Ser Tyr Cys C
Leu Ser Stop Stop A
Leu Ser Stop Trp G
C Leu Pro His Arg T
Leu Pro His Arg C
Leu Pro Gln Arg A
Leu Pro Gln Arg G
A Ile Thr Asn Ser T
Ile Thr Asn Ser C
Ile Thr Lys Arg A
Met Thr Lys Arg G
G Val Ala Asp Gly T
Val Ala Asp Gly C
Val Ala Glu Gly A
Val Ala Glu Gly G
14 | The Origin of Genomic Information

may contain both introns and exons. These pre-mRNA molecules go through a modification process in the
nucleus called splicing, during which the non-coding introns are cut out and only the coding exons remain.
Splicing produces a mature messenger RNA or the transcript that is then translated into a protein.

The Non-Coding Genomic Sequences

A large portion of the genomes of a living organism has no known function. Around 98% of human genome
does not code for protein. DNA that does not code for protein or has no known biological function is some-
time called junk DNA even though some of those non-coding DNA was found to have a role in the regulation
of some gene expression [22]. Generally, DNA without function is known as non-coding DNA. Most non-
coding DNA lies between genes on the chromosome, and it includes repetitive sequences, pseudogenes, and
telomeres.

Repetitive Sequences
Repetitive sequences are highly repeated non-coding DNA sequences that represent a large portion of
eukaryotic genomes (hundreds of thousands of copies per genome). They are categorized into three major
classes: simple repeated sequences, retrotransposons, and DNA transposons.

Simple Repeated Sequences Simple repeated sequences, also known as short tandem repeats (STRs) or
microsatellites, are a class of repeated sequences that consist of thousands of copies of short tandemly repeated
DNA sequences of a repetitive unit of one to six base pair forming repetitive sequences of up to hundreds of
nucleotides. For example, “AGGCT”, which is a sequence of five nucleotides, is repeated tandemly six times.

AGGCT AGGCT AGGCT AGGCT AGGCT AGGCT

If the repeated sequence is short, it can be called a minisatellite. A longer tandemly repeated sequence is
called a microsatellite. Such simple repeated sequences are widely found in prokaryotes and eukaryotes. They
may account for about 5% of the total genomic DNA in humans and a varied percentage in the genomes of
other organisms [23].

Retrotransposons Retrotransposons are the transposable elements whose transposition in the genome is
mediated by reverse transcription of RNA to DNA. Some RNA molecules are converted to complementary
DNA (cDNA) by an enzyme called reverse transcriptase. The new DNA copy is integrated at a new site in the
genome. This class of repetitive sequences is a major contributor to the genome size (45% of human genomic
DNA). Retrotransposons are classified into short, interspersed elements (SINEs), long interspersed elements
(LINEs), and retrovirus-like elements [24, 25]. SINEs and LINEs are examples of transposable elements that
can move to different sites in genomic DNA. SINEs may have thousands of copies (850,000 in human genome).
They range from 100 to 700 bp in length and make up around 21% of the genome. LINEs may have thousands
of copies (1,500,000 in human genome), which make up around 13% of human genome. The retrovirus-like
elements are DNA sequences that resemble retrovirus DNA sequences. A retrovirus is an RNA virus that uses
RNA as its genetic material. It is characterized by inserting a copy of its genome into the genomes of infected
cells. Analysis of genomic DNA reveals the presence of many thousands of retroviral-like elements, which
suggests that some of the genomes of retrovirus integrated in animal genomes and that the reverse transcrip-
tion of RNA to DNA played a major part in shaping the eukaryotic genome. The human retrovirus-like elem-
ents range from approximately 2,000 to 10,000 bp in length. There are approximately 450,000 retrovirus-like
elements in the human genome (around 8% of human DNA) [26].

DNA Transposons DNA transposons is a class of interspersed repetitive elements (DNA transposons) that
moves through the genome by being copied and reinserted as DNA sequences, rather than moving by reverse
The Origin of Genomic Information | 15

FIGURE 1.16 Formation of pseudogene by reverse transcription.

transcription. In the human genome, there are about 300,000 copies of DNA transposons, ranging from 80 to
3000 bp in length, and accounting for approximately 3% of human DNA [27].

Pseudogenes
A gene is a genomic DNA sequence with a coding region and regulatory regions. Sequencing and annota-
tion of the human genome and the genomes of other organisms uncovered the presence of multiple copies
of some known genes, but they lack some or their entire regulatory region. Therefore, they are not functional
genes, and they are classified as non-coding sequence. These non-functional genes are called pseudogenes.
In fact, some genes have multiple functional copies to produce more proteins when needed. Some of these
copies may mutate and become pseudogenes. A pseudogene may also be formed when a functional gene is
transcribed into mRNA but, instead of being transported to ribosome and translated into a protein mol-
ecule, it is converted into a piece of double-stranded complementary DNA (cDNA) by the enzymatic reverse
transcription (Figure 1.16). This cDNA is then integrated into the genome and becomes part of it as non-
functional pseudogene [28]. Studies have shown that there are around 11,000 pseudogenes in the human
genome [29, 30].

Telomeres
A telomere (Figure 1.2b) is a repetitive nucleotide sequence of the genomic DNA found at each end of a
eukaryotic chromosome and linear chromosome of several bacteria including Streptomyces, Borrelia, and
Rhodococcus. Telomeres act as caps that protect the end of chromosomes from deterioration. They play crit-
ical roles in chromosome replication and maintenance during cell division. Studies showed that telomeres get
shorter each time a cell copies itself. The repeated sequences of telomere consist of clusters of guanine residues
(Gs). The sequence of telomere repeats in humans and other mammals is “TTAGGG”, which is repeated
hundreds or thousands of times and terminate with a 3` overhang of single-stranded DNA. Maintenance of
telomeres appears to be an important factor in determining the life span and reproductive capacity of cells.
Studies have shown that cancer cells have high levels of an enzyme called telomerase, which allow the cells to
maintain the ends of their chromosomes through indefinite divisions [31, 32].

RIBONUCLEIC ACID
Ribonucleic acid or RNA, is a polymeric molecule produced from a gene by the process of gene transcription.
RNA is essential in almost all biological activities for its roles in coding for proteins (mRNA), transferring
of amino acids (tRNA), translating codons into amino acids (rRNA), and regulating expression of genes. An
RNA molecule is a single-stranded polymer that is slightly different from the DNA. It contains the sugar ribose
instead of deoxyribose that is found in the DNA molecule (Figure 1.17). The difference between ribose and
16 | The Origin of Genomic Information

FIGURE 1.17 Ribose and deoxyribose.

FIGURE 1.18 Uracil.

FIGURE 1.19 RNA types.

deoxyribose is that ribose possesses the hydroxyl group binding to the second carbon atom while deoxyribose
possesses only the hydrogen atom without oxygen [33].
Moreover, the RNA molecule is made up of the same nucleotides as DNA except that thymine is replaced
by a similar nucleotide called uracil (U) (Figure 1.18). Thus, the RNA sequence is made up of A, C, G, and U.
The order of nucleotides in an RNA molecule or fragment is represented by a sequence that is composed of
the letters A, C, G, and U for adenine, cytosine, guanine, and uracil nucleotide, respectively. An RNA sequence
may look like:
CUCACUAAAGACAGAAUGAAUGUAGAAAAGGCUGAAUUCUGUAAU
RNA is always a product of a gene as a result of gene transcription that is regulated by regulatory factors
and complex pathways to accommodate the biological activities in the cell. There are four types of RNA mol-
ecule in living cells (Figure 1.19): messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA (rRNA),
and MicroRNA (miRNA). Only mRNA is a protein-coding RNA (4%) while the others are protein non-coding
RNAs (96%).

Ribosomal RNA
Ribosomal RNA (rRNA) molecules are one of the major components forming the protein-synthesizing organ-
elle called ribosome in living cells. Growing mammalian cells may contain about ten million ribosomes. They
are transcribed from genes in the nucleus in eukaryotes or chromatin body in prokaryotes and exported to the
The Origin of Genomic Information | 17

FIGURE 1.20 Ribosome.

cytoplasm to translate the genetic information carried by mRNA into proteins. There are several rRNA types
depending on the molecule size measured on Svedberg coefficient (S). A Svedberg unit (S) is a non-metric unit
for sedimentation rate in a test tube under the centrifugal force of an ultra-high-speed centrifuge. A Svedberg
unit (S) is 10-13 seconds. The S value of an RNA molecule is determined by its mass, density, and shape.
An rRNA molecule or ribosomal subunit may be given the name “nS”, where “n” is the number of Svedberg
units. Both prokaryotic and eukaryotic ribosomes are formed of a larger and smaller subunit (Figure 1.20).
Prokaryotic ribosomes are composed of 50S and 30S while eukaryotic ribosomes are composed of 60S and 40S
subunits. The two subunits come together during mRNA translation into a protein. Each ribosome subunit is
made of both rRNA and proteins [34].
In bacteria, the larger ribosomal subunit (50S) contains 5S rRNA (≈120 nucleotides), 23S rRNA (≈3000
nucleotides), and 31 proteins. The smaller subunit (30S) contains 16S rRNA (≈1500 nucleotides) and 21
proteins. Each of these rRNA molecules (5S rRNA, 23S rRNA, and 16S rRNA) has its own gene called the 5S
rRNA, 23S rRNA, and 16S rRNA gene, respectively. Ribosomal RNA genes are called ribosomal DNA (rDNA).
The 16S rRNA gene is used in phylogeny and bacterial identification [35].
In eukaryotes, the 60S or large ribosomal subunit has three rRNA molecules, 28S rRNA (4,800 nucleotides),
5.8S rRNA (160 nucleotides), and 5S rRNA (120 nucleotides), and 50 proteins. The 40S small ribosomal sub-
unit has only 18S rRNA (1,900 nucleotides) and 33 proteins. Each eukaryotic rRNA has its own coding gene.
The rRNA gene or rDNA has been studied in detail for its biological importance and repeated nature [36, 37].

Transfer RNA
Transfer RNA (tRNA) plays an important role in the translation of protein synthesis in ribosomes. It translates
the message coded in the nucleotide sequences of mRNA into specific amino acids, which are joined together
to form a protein. Although tRNA is a single-stranded molecule, it can form double-stranded structures that
are important to its function in protein synthesis. Single strands of tRNA can hybridize to form a double-
stranded molecule and many secondary structures in which a single RNA molecule folds over and forms
hairpin loops stabilized by hydrogen bonds between complementary nucleotides [38]. Such base-pairing of
RNA is critical for tRNA. The tRNA molecule has two important regions: the amino acid binding site where
a specific amino acid is binding to the tRNA molecule, and a hairpin-like trinucleotide region called the anti-
codon, which is complementary to a codon in mRNA in the time of translation (Figure 1.21). A codon consists
of three nucleotides on an mRNA strand that encodes a specific amino acid. During translation (passing of
mRNA between the two ribosome subunits), a codon on mRNA complements an anticodon on the tRNA and
the corresponding amino acid is released from the tRNA and added to the growing polypeptide chain.

MicroRNA
MicroRNAs, or miRNAs, are small non-coding RNA molecules of about 22 nucleotides in length. They are
encoded in the genomes; some of them are transcribed from within protein-coding units (exons) of a gene
and others are transcribed from within non-coding regions of a gene (introns). The miRNAs generated from
18 | The Origin of Genomic Information

FIGURE 1.21 Transfer RNA.

exons are called canonical miRNAs, while the ones generated from introns are called non-canonical miRNAs.
MiRNAs are produced by two ribonucleases (Drosha and Dicer). The precursors of the miRNAs may be long
RNAs that fold to form hairpin secondary structures, which are then cleaved sequentially by the ribonucleases
(Figure 1.22).
MiRNAs play important regulatory role in gene expression by halting translation of mRNA into protein
with a process called RNA interference (RNAi) or gene silencing. A miRNA strand is incorporated into the
RNA-induced silencing complex (RISC), which is a protein complex that uses the incorporated miRNA strand
to complement the target mRNA and to stimulate its degradation by a protein in RISC.
The miRNA targeting sites are usually at 3`-untranslated regions (UTR) of the targeted mRNA. A single
miRNA can target up to 100 different mRNAs. Therefore, miRNAs may regulate more than half of the protein-
coding genes in eukaryotes. MiRNAs have also been linked to some types of cancers and translocation muta-
tion in chromosomes [39, 40].

Messenger RNA
Messenger RNA (mRNA) is transcribed from a gene and synthesized in the nucleus using the nucleotide
sequence of the gene as a template. The transcribed mRNA is complementary to the nucleotide sequence of one
DNA strand of the gene. The process of mRNA transcription requires nucleotide triphosphates as substrates
and the enzyme RNA polymerase II as a catalyst. The transcribed mRNA of prokaryotes is different from that
of eukaryotes. The prokaryotic primary mRNA contains no introns, while the eukaryotic primary mRNA
contains introns that will be spliced out to keep only the exons to form the final mRNA. The mRNA is formed
in the nucleus and transported to the cytoplasm where it passes through the ribosome (between smaller and
larger ribosome subunits) and interacts with the tRNA so the transcript will be translated into amino acids and
the protein is assembled. In the following we will discuss both transcription and translation [41].

Messenger RNA transcription

In eukaryotes, genes can be transcribed into RNA molecules. There may be thousands of genes in the genome of
an organism. For example, the human genome has around 30,000 genes. However, not all genes are transcribed;
some of them are transcribed only when they are needed. The process and intensity of gene transcription into
mRNA is known as gene expression. The set of all RNA transcripts, including coding and non-coding, in an
individual or a population of cells is called transcriptome. In complex organisms, the genes expressed in the
cells of each organ are different from others since each organ has different functions. The gene expression
The Origin of Genomic Information | 19

FIGURE 1.22 Formation of microRNA.

distinguishes cells of an organ from others even though all somatic cells have the same genome and the same
set of genes. For example, the genes expressed in hepatocytes or liver cells are the genes that play pivotal roles
in metabolism, detoxification, protein synthesis, and innate immunity against invading microorganisms. The
set of gene expressed in liver cells will be different from the set of genes expressed in the cells of lungs, kidney,
or brain. However, only housekeeping genes are expressed in all cells because they perform basic biochemical
activities that occur in all cells. Housekeeping genes are typically required for the maintenance of basic cellular
functions that are essential for the existence of a cell, regardless of its specific role in the tissue or organism. Thus,
they are expressed in all cells of an organism under normal and pathophysiological conditions, irrespective
of tissue type, developmental stage, cell cycle state, or external signal. Therefore, they are widely used as
internal controls for experimental studies such as in polymerase chain reaction (PCR) experiments. The list of
the housekeeping genes may vary from one organism to another. Some of the housekeeping genes of human
and other animals may include beta-actin (ACTB), glyceraldehyde-3-phosphate dehydrogenase (GAPDH),
phosphoglycerate kinase 1 (PGK1), peptidylprolyl isomerase A (PPIA), ribosomal protein P0 (PPLP0), acidic
ribosomal phosphoprotein PO (ARBP), and others [42, 43].
Generally, the amounts and types of mRNA molecules in a cell reflect the biological activities of that cell.
Thousands of different mRNAs are produced every second in every cell. The expression of any gene is regulated
by a complex pathway to provide the adequate number of proteins. Since the gene expression in any kind of
cells is affected by several factors, the level of expression of all genes called gene profiling in a cell in a spe-
cific time and in specific conditions always serves as a snapshot or check point for some genomic and cancer
research and it is also used in diagnosis of some diseases.
20 | The Origin of Genomic Information

RNA molecules are initially synthesized as precursors or pre-RNA, which must be processed by a pro-
cess called splicing in order to remove the introns and release the exons that are spliced together to form the
mature RNA. Splicing of pre-mRNA is therefore a major part of the process that results in synthesis of the
protein-coding component of the transcriptome or the whole mRNA of the cell. Some eukaryotic genes have
an ability to code for different protein variants or isoforms through so-called alternative splicing, which is the
rearranging and joining of exons in varied combinations to alter the sequence of the coded proteins. The pro-
tein isoforms may have different functions or properties.
The steps of the mRNA transcription from a gene in eukaryotic cells include initialization of mRNA tran-
scription, transcript elongation, transcript termination, and post-transcriptional modification [44].

Initialization Stage Transcription of the mRNA is regulated by transcription factors, which are proteins that
bind to the core promoter region of the gene. Most TFs have ligand binding domain (LBD) and DNA binding
domain (DBD). LBD is the site that, when it binds to an activating substance, activates the DBD of the TF to
bind to the DNA promoter region of the target gene. TFs are activated and deactivated by a complex pathway.
Once a TF binds to the promoter region of a target gene, the RNA polymerase II will bind to the polymerase
binding sequence of the core promoter region and then it separates the double-stranded gene providing two
separate strands of DNA, one of which will be the template strand or anti-sense strand for the transcription of
mRNA and the other is the non-template strand or coding strand or sense strand (Figure 1.23).

Elongation Stage Once the two DNA strands of the gene are separated, only one strand (the template strand)
will be the template for the RNA polymerase II. The enzyme will start to read one base at a time to build an
RNA strand out of complementary nucleotides. The complementary RNA strand grows from 5` end to 3` end
and will continue until all mRNA is transcribed (Figure 1.24). The RNA transcript will be identical to the non-
template strand except that thymine (T) will be replaced by uracil (U).

Termination Stage The mRNA strand will continue to grow in the above elongation stage until it reaches a
transcribed sequence that bends to form a hairpin structure called a terminator, which terminates the tran-
scription process (Figure 1.25). The pre-mRNA is then released from the RNA polymerase.

FIGURE 1.23 Transcription initialization.

The Origin of Genomic Information | 21

FIGURE 1.24 Transcription elongation stage.

FIGURE 1.25 Transcription termination.

Posttranscriptional Modifications After transcription, the pre-mRNA undergoes some essential modifications
to be ready for the translation stage in ribosomes. The post-transcriptional modifications of the pre-mRNA
include splicing and 5`-end and 3`-end modifications (addition of a cap to the 5`-end, and addition of a
poly(A) tail to the 3`-end).

Gene Splicing Gene splicing is a post-transcriptional modification in which a single gene can code for multiple protein
variants or isoforms. There are several types of gene-splicing processes, which can take place simultaneously after the
pre-mRNA is formed. Exon Skipping is the most common gene-splicing process in which an exon or exons are included
or excluded from the final gene transcript leading to extended or shortened mRNA variants. Intron Retention (IR) is
the splicing process in which an intron (non-coding region) is retained in the final transcript. The IR plays a role in
gene expression regulation and associations with complex diseases. Alternative 3` Splice Site and alternative 5` Splice
Site are the splicing mechanism in which an alternative gene splicing includes joining of different 5` and 3` splice site.
Figure 1.26 shows the types of gene splicing [45].

The 5`-End and 3`-End Modifications Both ends of the pre-mRNA are modified. A cap of 7-methylguanosine is
attached at the 5`-end of the pre-mRNA. This cap is needed to help initiate translation of the mRNA into a protein in
the ribosome.
A series of about 250 adenine nucleotides called poly(A) tail is attached to the 3`-end of the pre-mRNA. The
poly(A) synthesis is catalyzed by poly(A) polymerase.
22 | The Origin of Genomic Information

FIGURE 1.26 Gene alternative splicing.

FIGURE 1.27 pre-mRNA modifications.

The structure of the mature functional mRNA consists of 5` cap, 5` untranslated region (5`UTR), coding
region called open reading frame (ORF), 3` untranslated region (3`UTR), and poly-A tail (Figure 1.27).

Five Prime Untranslated Region The 5`-untranslated region (5`-UTR) is the untranslated part of a mRNA
strand that extends from its 5`-terminus (cap site) to the translational start codon (AUG). The 5`-UTR
sequence length is about three to ten bases in prokaryotic mRNA and varies from hundreds to thousands
bases in eukaryotic mRNA (150 bases in human).

Three Prime Untranslated Region The 3`-untranslated region (3`-UTR) is the section of mRNA that immedi-
ately follows the translation termination codon (stop codon).

The Open Reading Frame The protein-coding region is known as open reading frames (ORFs). It consists of a
series of codons that specify the amino acid residues of the protein sequence that the gene codes for. A codon
consists of three consecutive nucleotides that code for a specific amino acid. There are 64 possible codons
but there are only 20 amino acids used for protein synthesis. Some amino acids can be coded by more than
one codon.
The Origin of Genomic Information | 23

TABLE 1.5 Genetic Code [21]

Second Position

First Position U C A G Third Position

U Phe Ser Tyr Cys U

Phe Ser Tyr Cys C
Leu Ser Stop Stop A
Leu Ser Stop Trp G
C Leu Pro His Arg U
Leu Pro His Arg C
Leu Pro Gln Arg A
Leu Pro Gln Arg G
A Ile Thr Asn Ser U
Ile Thr Asn Ser C
Ile Thr Lys Arg A
Met Thr Lys Arg G
G Val Ala Asp Gly U
Val Ala Asp Gly C
Val Ala Glu Gly A
Val Ala Glu Gly G

FIGURE 1.28 Possible open reading frames (ORFs)

Table 1.5 shows the codons and the amino acids. The first bases of the codon are in the first column, the
second bases are in the first row, and the third bases are in the last column of the table. For example, UCU,
UCC, UCA, and UCG are the codons for serine (Ser). The property, in which a single amino acid is coded by
multiple codons, is called degeneracy of codons or the redundancy of the genetic code. The genetic code is
degenerate mainly at the third codon position to reduce the effect of mutations.
The ORF usually (but not always) begins with the start codon AUG, and ends with a stop codon, which is any
one of UAA, UAG, or UGA. ORF scanning or ab initio gene prediction programs predict genes by searching
a DNA sequence for ORFs that begin with an ATG and end with a termination triplet. The gene scanning is
complicated by the fact that each DNA sequence has six possible reading frames; three possible frames are
in one direction and three frames are in the reverse direction on the complementary strand (Figure 1.28).
Computers are quite capable of scanning all six reading frames for ORFs.
Given the coding region of a gene, we can use the above information to predict the mRNA sequence and
the ORFs.

Messenger RNA Translation

Transcription and RNA processing are followed by translation and the synthesis of proteins as directed by
the mRNA template. The mRNAs are read in the 5` to 3` direction, and polypeptide chains are synthesized
by translating the ORF. Each amino acid is specified by three bases or a codon in the ORF, according to a
nearly universal genetic code. Translation is carried out on ribosomes, with tRNA strands serving as adaptors
between the mRNA template and the amino acids being incorporated into the protein. Protein synthesis thus
24 | The Origin of Genomic Information

FIGURE 1.29 Translation in ribosome.

FIGURE 1.30 General chemical composition of amino acids.

involves interactions among three types of RNA molecules (mRNA, tRNA, and rRNA), as well as various
proteins that are required for translation.
During the translation of a mRNA into a protein, the 20 amino acids are aligned with their corresponding
codons on the mRNA template. A tRNA is approximately 75 nucleotides long and it forms complementary
base pairing between different regions of the molecule and folds into an L-shape to fit onto ribosomes during
the translation process (Figure 1.29). Each of the tRNA strand has the sequence CCA at its 3` terminus, where
an amino acid is covalently attached to the ribose of the terminal adenosine (A). The codon on the mRNA
template is recognized by the anticodon loop located at the other end of the folded tRNA, forming comple-
mentary base pairing. The attachment of each amino acid to its specific tRNA is mediated by a specific enzyme
called aminoacyl tRNA synthetase. There is an enzyme for each amino acid. When a codon complements an
anticodon, the amino acid attached to the tRNA will be released to bind to the growing protein sequence. The
translation will continue until the entire ORF is translated and the stop codon is reached. The process of trans-
lation is carried out by millions of ribosomes in the cell simultaneously [46].

THE PROTEINS
Proteins are the biomolecules that are translated from gene transcript or mRNAs as discussed above. Any pro-
tein must be coded by a gene. However, a gene may code for multiple protein isoforms that differ in sequences
and functions. A protein molecule is built of 20 amino acids, which are the building blocks of any proteins.
Amino acids are small organic molecules that consist of an alpha (central) carbon atom linked to an amino
group (NH2), a carboxyl group (COOH), a hydrogen atom (H), and a variable component called a side chain
given the symbol (R) (Figure 1.30) [47].
In a protein molecule, multiple amino acids are linked together by peptide bonds, thereby forming a
long chain or polypeptide. Peptide bonds are formed by a biochemical reaction that extracts a water mol-
ecule as it joins the amino group of one amino acid to the carboxyl group of a neighboring amino acid.
The linear sequence of amino acids within a protein is considered the primary structure of the protein
(Figure 1.31).
Each amino acid has a unique side chain (R), which has different chemistry from that of other amino
acids. Most amino acids have non-polar side chains, which have pure hydrocarbon alkyl groups (alkane
The Origin of Genomic Information | 25

FIGURE 1.31 Formation of peptide bond between amino acids.

FIGURE 1.32 Non-polar side chains.

branches) or aromatic (benzene rings). Amino acids with non-polar side chains include glycine (G), alanine
(A), valine (V), leucine (L), isopleucine (I), methionine (M), phenylalanine (F), tryptophan (W), and proline
(P) (Figure 1.32).
Other amino acids have polar but uncharged side chains. These polar amino acids include serine (S), threo-
nine (T), cysteine (C), tyrosine (Y), asparagine (N), and glutamine (Q) (Figure 1.33).
Other amino acids have side chains with positive (basic amino acids) or negative charges (acidic amino
acids). Those amino acids that have electrically charged side chains include aspartate (D), glutamate (E), lysine
(K), arginine (R), and histidine (H) (Figure 1.34).
The physicochemical properties of the side chains of amino acids are critical to protein structures
because these side chains can bind with one another to fold the linear protein residues in a certain shape or
conformation forming the so-called three-dimensional structure (3D) of protein. Any protein has a specific
three-dimensional structure that determines its biological functions in the living cells. Charged amino
acid side chains can form ionic bonds, and polar amino acids can form hydrogen bonds. Hydrophobic
side chains interact with each other via weak Van der Waals interactions. Most bonds formed by these
side chains are non-covalent. Only cysteines can form covalent bonds with their side chains that contain
26 | The Origin of Genomic Information

FIGURE 1.33 Amino acids with polar uncharged side chains.

FIGURE 1.34 Amino acids with electrically charged side chains.

FIGURE 1.35 Disulfide bridges between two cysteine amino acids.

sulfur. If a sulfur atom is bonded to a sulfur atom of another cysteine, a covalent disulfide bridge is formed
(Figure 1.35).
The order of amino acids in a protein or a polypeptide is represented by a sequence formed from single-
letter abbreviations of the amino acid forming the molecule. Table 1.5 contains the names of the 20 amino
acids and their three-letter and single-letter abbreviations.
In bioinformatics, the order of amino acids (linked to one another by peptide bonds) in protein molecules
are represented by linear sequences made up of the single-letter symbols of the 20 amino acids as follows:

LTKDRMNVEKAEFCNKSKQPGLARSQHNRWAGSKETCNDRRTPSTEKKVDLNADPLCERKEWNKQK
LPCSENPRDTEDVPWITLNSSIQK
The Origin of Genomic Information | 27

TABLE 1.5 Amino Acids and Their Symbols

# Amino Acid Three-Letter Symbol Single-Letter Symbol

1 Alanine Ala A
2 Arginine Arg R
3 Asparagine Asn N
4 Aspartate Asp D
5 Cysteine Cys C
6 Glutamine Gln Q
7 Glutamate Glu E
8 Glycine Gly G
9 Histidine His H
10 Isoleucine Iso I
11 Leucine Leu L
12 Lysine Lys K
13 Methionine Met M
14 Phenylalanine Phe F
15 Proline Pro P
16 Serine Ser S
17 Threonine Thr T
18 Tryptophan Try W
19 Tyrosine Tyr Y
20 Valine Val V

Three-Dimensional Protein Structure

The order of the amino acids in a protein molecule in the sequence above represents a linear structure of
a protein molecule, in which the amino acids are linked by peptide bonds. However, the side chains of the
amino acids may link to one another forming the three-dimensional structure of a protein. Scientific studies
found that the order of the amino acids in a protein sequence (the linear structure) determines the final three-
dimensional shape of the protein or protein conformation. Even though a protein has an astronomical number
of possible conformations (Levinthal’s paradox), a specific conformation is adopted once the protein has been
synthesized [48]. Research has shown that a protein can be denatured by treating with certain solvents, which
destroy the non-covalent interactions (between the side chains) that hold the folded chain together. This
treatment converts the protein into a polypeptide chain that has lost its natural three-dimensional shape.
When the denaturing solvent is removed, the protein often refolds spontaneously, or re-natures, into its ori-
ginal conformation, proving that all the information needed for the specific conformation of a protein is
contained in its linear amino acid sequence. There are four types of protein structure: primary, secondary,
tertiary, and quaternary [44, 49].

Primary Structure The primary structure in a protein molecule is the linear sequence of amino acids in a poly-
peptide chain. No links are presented between the amino acid residues but only the peptide bonds that create
the chain (Figure 1.36). The sequence of a protein represents its primary structure.

Secondary Structure The secondary structure of a protein molecule is that the amino acids in the molecules
form hydrogen bonds with one another. Thus, regions of the polypeptide fold due to interactions between
atoms of the backbone atoms. The arrangement and folding of the polypeptide chain may form two secondary
shapes: alpha helix and beta strand.
The alpha-helix (α-helix) is a spiral configuration of the polypeptide chain for a helix structure, which has
either a right direction (right-handed helix) or left direction (left-handed helix). In an alpha helix, the carbonyl
(C=O) of one amino acid forms a hydrogen bond with the amino group (N-H) of another amino acid (e.g., the
carbonyl of amino acid 1 would form a hydrogen bond to the N-H of amino acid 5). Such pattern of hydrogen
28 | The Origin of Genomic Information

FIGURE 1.36 Primary structure of protein.

FIGURE 1.37 Secondary structures of protein (a) alpha helix and (b) beta sheet.

bonds shapes the polypeptide chain into a helical ribbon with each turn of the helix containing 3.6 amino
acids. The side chains or R groups of the amino acids project outward forming the alpha helix (Figure 1.37a).
The Beta strand (β-sheet) is made of two or more adjacent segments of the same polypeptide chain, forming
a sheet-like structure stabilized by hydrogen bonds (Figure 1.37b). The hydrogen bonds are formed between
the carbonyl groups and amino groups of the backbone of the polypeptide, while the side chains extend above
and below the plane of the sheet. The strands forming the beta sheet can be parallel, pointing to the same
direction (N-and C-termini match), or anti-parallel, pointing in opposite directions (the N-terminus of one
strand is positioned next to the C-terminus of the other).

Tertiary Structure The tertiary structure of a protein molecule is the protein structure at which an entire poly-
peptide folds to form a three-dimensional shape. It is primarily due to interactions between the side chains
(R groups) of the amino acids that make up the polypeptide. The side chain interactions that contribute to the
tertiary structure may include hydrogen bonding, ionic bonding, dipole-dipole, and hydrophobic interactions.
Moreover, the disulfide covalent bonds between the sulfur atoms in the side chain of cysteines in the polypep-
tide contribute as well to the tertiary structure of proteins (Figure 1.38).
The Origin of Genomic Information | 29

FIGURE 1.38 Tertiary structure of protein.

FIGURE 1.39 Quaternary structure of protein.

Quaternary Structure Some proteins are formed of multiple polypeptide chains or subunits. The quaternary
structure of a protein is formed by the association of several polypeptide chains or subunits into a closely
packed protein arrangement. The subunits are held together by hydrogen bonds and van der Waals forces
between non-polar side chains of the amino acids. Each of the subunits may have its own primary, secondary,
or tertiary structure (Figure 1.39).
30 | The Origin of Genomic Information

Protein Structure Representation and File Formats

Experimentally, the three-dimensional structures of proteins are solved by techniques such as X-ray crys-
tallography and nuclear magnetic resonance (NMR). X-ray crystallography is a technique used for deter-
mining the atomic and molecular structure of a crystal by diffraction of X-rays and measuring the angles
and intensities of these diffracted beams. Thus, the three-dimensional image of the density of electrons
within the crystal is captured. The image will determine the positions of the atoms in the crystal and their
chemical bonds. NMR is a spectroscopic technique that can be used as well to solve the structure of proteins
by placing a protein molecule in very powerful superconducting magnets and capturing the radio frequency
energy.
The three-dimensional structures of many proteins or portions of proteins have been determined using
either X-ray crystallography or NMR. The solved structures of proteins or fragments of proteins are deposited
by scientists into databases for protein structures such as Protein Data Bank (PDB), which is available at
www.rcsb.org/. The PDB is a freely accessible repository containing thousands of solved structures of proteins
and other biomolecules. The PDB stores the information generated by the solved three-dimensional shape
of a protein in the PDB file format, which was invented in 1976 as a human-readable file to store protein
coordinates. It has a fixed-column width format limited to 80 columns. The PDB file format has undergone
several revisions over the years; the latest revision was on November 21, 2012. However, the PDB file format is
no longer being modified or extended to support new content. Instead, the PDBx/mmCIF format is used [50].

PDB File Format

The PDB file for a protein contains detailed information about the protein structure, including the position of
each atom in three-dimensional space and its connectivity [51]. Each line of information in the file is called
a record. A PDB file may contain several different types of records, arranged in a specific order to describe
a structure. The most used records include ATOM, HETATM, TER, HELIX, SHEET, and SSBOND. Each
ATOM record describes the position of an amino acid in the protein molecule using atomic coordinates (XYZ
coordinates). A HETATM record describes the position (XYZ coordinates) of a non-standard residue such as
inhibitors, cofactors, ions, and solvent. A TER record indicates the end of a chain of residues. A HELIX record
describes the location and type of helices. A helix has a single record. A SHEET record describes the location
and direction of each sheet. A SSBOND record defines disulfide bond linkages between cysteine residues.
More details about PDB file format are available at www.wwpdb.org/documentation/file-format. Figure 1.40
shows examples of ATOM and HETAM records that describe the positions of the atoms forming amino acids
and ligands, respectively. The columns from left to right are the record type, atom number, atom identity,
residue identity, residue number in the sequence, coordinates X, Y, and Z, occupancy, temperature factor, and
element symbol.

FIGURE 1.40 ATOM and HETATM records of a PDB file.

The Origin of Genomic Information | 31

PDBx or mmCIF File Format

The PDBx/mmCIF format became the standard PDB archive and an alternative to the PDB format in 2014.
It is now the default format used by the PDB. The name mmCIF stands for Macromolecular Crystallographic
Information File. The PDBx/mmCIF format is closely related to the Crystallographic Information File (CIF),
which is a standard text file format for representing crystallographic information [52]. It explicitly documents
all relationships between common data items such as atom and residue identifiers. This permits software
applications to evaluate and validate referential integrity with any PDB entry, and maps information between
the residue sequences of the experimental sample and the model coordinates.
The same information in the PDB file format is now available in the PDBx/mmCIF format. All data
items in the PDBx/mmCIF format are identified by a name that begins with the underscore character. Each
data item may consist of a category name and an attribute name. The category name is separated from the
attribute name by a period. This combination of category and attribute (e.g., _atom_site.id) is called an
mmCIF token. Data categories are presented either in key-value or tabular. Figure 1.41 shows an example of
the key-value style.
The tabular style is used when there are multiple values for each token. In the tabular style, a "loop_" token
is followed by rows of data item names and then white-space delimited data values. Figure 1.42 shows an
example for the tabular style.
Each data item name after "loop_" token and starting with "_atom_site" corresponds to the data value in
the data line starting with ATOM. For example, the item names _atom_site.Cartn_x, _atom_site.Cartn_y, and
_atom_site.Cartn_z correspond to the 11th, 12th, and 13th data items (atomic coordinates). The list of data
items is then looped through for each line of data values.

Visualizing Protein Structure

Once a protein structure has been solved and the structure information has been stored in a file, the three-
dimensional structure of that protein can be visualized using protein molecular graphic software. There
are numerous computer graphics programs that have been developed for visualizing and manipulating
complicated three-dimensional structures of biomolecules. Computer visualization programs allow users to
visually manipulate the structural images through a graphical user interface (GUI). A user can move, rotate,
and zoom in or out an atomic model on a computer screen in real time, or examine any portion of the struc-
ture in detail. Examples of molecular visualization programs include the VMD (Visual Molecular Dynamics)
and PyMol. These programs are designed for modeling, visualization, and analysis of biomolecules. They can
read the PDB and PDBx/mmCIF file formats and display the contained structure. They can also provide a wide
variety of methods for rendering and coloring a protein molecule. The VMD is available at www.ks.uiuc.edu/
Research/vmd/ as an open-source program. It requires registration for downloading and installation. PyMol
is available at https://pymol.org and it has different licenses including free licenses for students and teachers.
The PDB or PDBx/mmCIF file of a protein can be downloaded from the PDB website and visualized on any
of these programs. Figure 1.43 shows PyMol visualizing the structure of the ligand binding domain of PPAR
Gamma (PDB ID: 2HFP).

FIGURE 1.41 Key-value style of PDBx/mmCIF format.

32 | The Origin of Genomic Information

FIGURE 1.42 Tabular style of PDBx/mmCIF format.

FIGURE 1.43 PyMol visualizing ligand binding domain of PPAR gamma.

The Origin of Genomic Information | 33

GENE MUTATIONS
A gene mutation is the change that takes place in the sequence of a gene, and it contributes to diversity across
organisms and may have widely differing consequences. For mutations to be heritable they must occur in
cells that produce the next generation or germline cells. Such mutation is called germline mutations. On the
other hand, somatic mutations (occur in the somatic cells) are not inheritable, and they only affect the present
organism’s body. The somatic mutations have no significance for the evolution. Evolutionary theory is mostly
interested in germline mutations. Mutations may occur randomly, and the consequence may be harmful,
useful, or have no effect at all. The beneficial gene changes may be conserved for a long term and passed to
offspring. Generally, mutation rates are usually very low, and biological systems go to extraordinary lengths
to keep them as low as possible, mostly because many mutational effects are harmful. Moreover, DNA repair
or proofreading during DNA replication reduces the mutation rates. Mutations can also be due to the natural
exposure of an organism to certain environmental factors, such as radiation or chemical carcinogens.
DNA mutations can occur in several ways. They have varying effects on the living organism, depending on
where they occur and whether they alter the function of essential proteins. When the change involves only one
nucleotide position in the DNA sequence it is called point mutation. DNA mutation can be caused by substi-
tution, insertion, or deletion. In substitution mutation, a nucleotide will be replaced by one of the other three
nucleotides. Substitution mutation that involves a single position is known as single nucleotide polymorphism
(SNP). Substitution mutation can be silent, missense, or nonsense. In silent substitution mutation, the replace-
ment does not change the coded amino acid (synonymous codon). For example, both codons GCA and GCC
code for alanine amino acid (Ala), therefore replacing adenine (A) in GCA with cytosine (C) does not change
the translated amino acid (Ala) and it has no effect on the protein. In missense substitution mutation, the
nucleotide replacement will result in a codon that codes for a different amino acid (nonsynonymous codon).
For example, the codon GGA codes for glycine (Gly), so substituting the first guanine (G) with adenine
(A) will result in AGA codon, which codes for arginine amino acid (Arg). Therefore, it has a consequence in
the protein translation. In a nonsense substitution mutation, the nucleotide substitution will result in a stop
codon that terminates the protein translation before completion and the consequence will be a shorter protein
that does not function properly. For example, the codon CAA codes for glutamine (Gln). If the cytosine (C) in
CAA (in a protein coding region) is replaced by thymine (T), the resulting codon will be TAA (UAA), which
is the stop codon. During translation in ribosome, the translation of protein will be terminated when UAA is
reached. Figure 1.44 shows the three types of substitution.
In insertion mutation, a nucleotide or several nucleotides are inserted in the DNA sequence of a gene while
in deletion mutation a nucleotide or several nucleotides are removed from the DNA sequence. Deletion and
insertion mutation may cause frameshift, which occurs when mutation changes the open reading frame (ORF)
of a gene. The resulting protein is usually nonfunctional.

FIGURE 1.44 Substitution mutation and consequences.

34 | The Origin of Genomic Information

VIRUS GENOME
Viruses are not living organisms; they lack the characteristics that living organisms have. Living organisms have
definite cells that contain organelles and a metabolism system, and they can provide themselves with energy.
Viruses do not have such ability; however, they can replicate but only inside living cells and can adapt to their
environment and mutate. Since the virus is not a living organism it is defined as a small collection of genetic
code, either DNA or RNA, surrounded by a protein coat called a capsid. The virus genome is surrounded by
a nucleocapsid. Most viruses, but not all of them, are also surrounded by a membrane called the envelope
surrounding the capsid. The virus envelope is usually acquired from the nuclear or cell membrane of the
infected host cell (Figure 1.45) [33].
Viruses are classified based on the Baltimore classification system, which depends on the combination of
their nucleic acid (DNA or RNA), nature of the strand (single-stranded or double-stranded), sense (positive-
sense or negative-sense), and method of replication. The virus genome can be double-stranded DNA (dsDNA),
single-stranded DNA (ssDNA), double-stranded RNA (dsRNA), or single-stranded RNA (ssRNA). The latter
can either be positive-sense single-stranded RNA (+ssRNA), or negative-sense single-stranded RNA (-ssRNA).
The positive-sense RNA strand if the RNA sequence of that strand is translated or translatable into protein
and negative-sense RNA strand if the complementary RNA sequence of that strand, and not the original
strand itself, is translated or translatable into protein [53]. The above criteria are used to classify viruses into
several families such as Adenoviridae (dsDNA), Coronaviridae (ssRNA), Retroviridae (ssRNA), Parvoviridae
(ssDNA), Poxviridae (dsDNA), etc.
Viruses proliferate by infecting cells and hijacking their protein synthesis machinery to replicate, producing
multiple virions that are ready to infect new cells. A virion is a complete virus particle with all virus components.
During infection, a virion first attaches to receptors on the membrane of a living cell. The receptors can be
proteins, carbohydrates, or lipids. Once a virus is attached to a receptor it can enter the cell using one of two
routes: endocytosis or non-endocytosis. In the endocytic route, the virus is coated with a clathrin protein and
then transported in a vacuole into the cell. In non-endocytic route, the virus crosses the plasma membrane at
neutral pH. Some viruses use both entry routes. Viruses such as retroviruses can also enter cells through cell-
to-cell transmission [54].
Once the virus enters the host cell it will start to replicate. The replication depends on the viral genome
(dsDNA, ssDNA, dsRNA, +ssRNA, -ssRNA, etc.). The transcription and translation are like the ones discussed
above; however, viruses use their genomes, but they exploit the protein synthesis machinery to direct the host
cell to synthesize their viral enzymes and capsid proteins and to assemble their new viral genomes. DNA

FIGURE 1.45 Virus particle.

The Origin of Genomic Information | 35

viruses usually use the host cell proteins and enzymes to make additional DNA that is transcribed into mRNA
translated into proteins. RNA viruses use the RNA strand as the template for mRNA, which is translated into
proteins, and viral genomic RNA for assembling new virions. The RNA-virus Retroviruses, such as HIV, must
be reverse-transcribed into DNA using reverse transcription process, which then is incorporated into the host
cell genome. The process of reverse transcription is mediated by an enzyme known as reverse transcriptase.
This enzyme is used at the laboratory to convert RNA into DNA called complementary DNA (cDNA) [55].
Following synthesis of viral proteins and viral genomes, the genomes are then encapsulated with capsid
protein forming new virions, which are then released from the host cells. The number of virions released from
a single infected cell is called the viral burst size. The virions can infect the adjacent cells and repeat the rep-
lication cycle. Some viruses are released after the death of the host cells, and some leave the infected cells by
passing through the cell membrane without killing the infected cells.
Most viruses cause diseases; some of these diseases are severe, highly contagious and deadly. In past decades,
the world has faced several viral disease outbreaks, such as the Ebola, SARS, and Zika viruses, which have had
a massive death toll and impact on economies. Recently, the world has been struck by the outbreak of COVID-
19, which is caused by a coronavirus called SARS-CoV-2. Up to this date, the world is trying to control its
spread and to reduce its impact.
The COVID-19 infection starts with the attachment stage, in which the virus spike protein binds to the
receptors on the cell membrane. Coronavirus uses the non-endocytic route to enter the host cell. The viral
RNA is translated into large replicase polyproteins (pp1a and pp1ab), which subsequently cleave into viral
NSPs (Nonstructural Proteins). The full anti-sense single-stranded RNA (-ssRNA) copies of the coronavirus
genome are produced by the enzyme replicase using the full +ssRNA virus genome as a template strand. The
spike protein, envelope protein, nucleocapsid protein, and capsid protein are translated from segments of the
mRNA and are used in the assembly of new virions in Golgi body and endoplasmic reticulum of the host cell.
Once virons have been assembled, they can be released by transporting them in vacuoles to the extracellular
to infect new cells [56].

REFERENCES
1. Cochrane, G., et al., The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res, 2011. 39
(Database issue): pp. D15–18.
2. Hetzer, M. and G. Cavalli, Eukaryotic cells. Curr Opin Cell Biol, 2011. 23(3): pp. 255–257.
3. Souza, W., Prokaryotic cells: structural organisation of the cytoskeleton and organelles. Mem Inst Oswaldo Cruz,
2012. 107(3): pp. 283–293.
4. Benbow, R.M., Chromosome structures. Sci Prog, 1992. 76(301-302 Pt 3–4): pp. 425–450.
5. Schreck, R.R. and C.M. Disteche, Chromosome banding techniques. Curr Protoc Hum Genet, 2001.
Chapter 4: Unit4 2.
6. Bolcun-Filas, E. and M.A. Handel, Meiosis: the chromosomal foundation of reproduction. Biol Reprod, 2018.
99(1): pp. 112–126.
7. Garbers, D.L., Molecular basis of fertilization. Annu Rev Biochem, 1989. 58: pp. 719–742.
8. Georgadaki, K., et al., The molecular basis of fertilization (Review). Int J Mol Med, 2016. 38(4): pp. 979–986.
9. Vaillancourt, C. and J. Lafond, Human embryogenesis: overview. Methods Mol Biol, 2009. 550:
pp. 3–7.
10. Margolin, W., FtsZ and the division of prokaryotic cells and organelles. Nat Rev Mol Cell Biol, 2005. 6(11): pp. 862–871.
11. Watson, J.D. and F.H. Crick, Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Clin
Orthop Relat Res, 2007. 462: pp. 3–5.
12. Green, E.D., J.D. Watson, and F.S. Collins, Human Genome Project: Twenty-five years of big biology. Nature, 2015.
526(7571): pp. 29–31.
13. Salzberg, S.L., Open questions: How many genes do we have? BMC Biol, 2018. 16(1): p. 94.
14. Bruford, E.A., et al., Guidelines for human gene nomenclature. Nat Genet, 2020. 52(8): pp. 754–758.
15. Wojcik, F., [Guidelines for human gene nomenclature]. Ann Biol Clin (Paris), 2002. 60(3): pp. 347–350.
36 | The Origin of Genomic Information

16. Stringer, C., The origin and evolution of Homo sapiens. Philos Trans R Soc Lond B Biol Sci, 2016. 371(1698).
17. Spieth, J. and D. Lawson, Overview of gene structure. WormBook, 2006: pp. 1–10.
18. Spieth, J., et al., Overview of gene structure in C. elegans. WormBook, 2014: pp. 1–18.
19. Haberle, V. and A. Stark, Eukaryotic core promoters and the functional basis of transcription initiation. Nat Rev Mol
Cell Biol, 2018. 19(10): pp. 621–637.
20. IUPAC-IUB Commission on Biochemical Nomenclature (CBN). Abbreviations and symbols for nucleic acids,
polynucleotides and their constituents. Recommendations 1970. Eur J Biochem, 1970. 15(2): pp. 203–208.
21. Shu, J.J., A new integrated symmetrical table for genetic codes. Biosystems, 2017. 151: pp. 21–26.
22. Pennisi, E., Genomics. ENCODE project writes eulogy for junk DNA. Science, 2012. 337(6099): pp. 1159, 1161.
23. Fan, H. and J.Y. Chu, A brief review of short tandem repeat mutation. Genomics Proteomics Bioinformatics, 2007.
5(1): pp. 7–14.
24. Ponicsan, S.L., J.F. Kugel, and J.A. Goodrich, Genomic gems: SINE RNAs regulate mRNA production. Curr Opin
Genet Dev, 2010. 20(2): pp. 149–155.
25. Nelson, P.N., et al., Human endogenous retroviruses: transposable elements with potential? Clin Exp Immunol,
2004. 138(1): pp. 1–9.
26. Cordaux, R. and M.A. Batzer, The impact of retrotransposons on human genome evolution. Nat Rev Genet, 2009.
10(10): pp. 691–703.
27. Bourque, G., et al., Ten things you should know about transposable elements. Genome Biol, 2018. 19(1): p. 199.
28. Zheng, D., et al., Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolu-
tion. Genome Res, 2007. 17(6): pp. 839–851.
29. Torrents, D., et al., A genome-wide survey of human pseudogenes. Genome Res, 2003. 13(12): pp. 2559–2567.
30. Bischof, J.M., et al., Genome-wide identification of pseudogenes capable of disease-causing gene conversion. Hum
Mutat, 2006. 27(6): pp. 545–552.
31. Turner, K.J., V. Vasu, and D.K. Griffin, Telomere Biology and Human Phenotype. Cells, 2019. 8(1).
32. Aguado, J., F. d’Adda di Fagagna, and E. Wolvetang, Telomere transcription in ageing. Ageing Res Rev, 2020.
62: pp. 101–115.
33. Lodish H.B.A., S.L. Zipursky, et al., Molecular Cell Biology. 4th ed. 2000. New York: Freeman.
34. Yusupov, M.M., et al., Crystal structure of the ribosome at 5.5 A resolution. Science, 2001. 292(5518): pp. 883–896.
35. Torres-Machorro, A.L., et al., Ribosomal RNA genes in eukaryotic microorganisms: witnesses of phylogeny? FEMS
Microbiol Rev, 2010. 34(1): pp. 59–86.
36. Long, E.O. and I.B. Dawid, Repeated genes in eukaryotes. Annu Rev Biochem, 1980. 49: pp. 727–764.
37. Sollner-Webb, B. and E.B. Mougey, News from the nucleolus: rRNA gene expression. Trends Biochem Sci, 1991.
16(2): pp. 58–62.
38. Rich, A., A Hybrid Helix Containing Both Deoxyribose and Ribose Polynucleotides and Its Relation to the Transfer
of Information between the Nucleic Acids. Proc Natl Acad Sci USA, 1960. 46(8): pp. 1044–1053.
39. Bartel, D.P., Metazoan MicroRNAs. Cell, 2018. 173(1): pp. 20–51.
40. Bartel, D.P., MicroRNAs: target recognition and regulatory functions. Cell, 2009. 136(2): pp. 215–233.
41. Watson, J.D., Molecular Biology of the Gene. 7th ed. 2013. Pearson.
42. Eisenberg, E. and E.Y. Levanon, Human housekeeping genes are compact. Trends Genet, 2003. 19(7): pp. 362–365.
43. Eisenberg, E. and E.Y. Levanon, Human housekeeping genes, revisited. Trends Genet, 2013. 29(10): pp. 569–574.
44. Alberts B.J.A., J. Lewis, M. Raff, K. Roberts, and .P Walter, Molecular Biology of the Cell. 6th ed. 2008.
New York: Garland Science.
45. Pan, Q., et al., Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput
sequencing. Nat Genet, 2008. 40(12): pp. 1413–1415.
46. Cooper, G., The Cell: A Molecular Approach. 8th ed. 2018: Oxford: Oxford University Press.
47. Nelson, D.L., Lehninger Principles of Biochemistry. 7th ed. 2017: New York: W.H. Freeman.
48. Zwanzig, R., A. Szabo, and B. Bagchi, Levinthal’s paradox. Proc Natl Acad Sci USA, 1992. 89(1):
pp. 20–22.
49. Pauling, L., R.B. Corey, and H.R. Branson, The structure of proteins; two hydrogen-bonded helical configurations of
the polypeptide chain. Proc Natl Acad Sci USA, 1951. 37(4): pp. 205–211.
50. Adams, P.D., et al., Announcing mandatory submission of PDBx/mmCIF format files for crystallographic depositions
to the Protein Data Bank (PDB). Acta Crystallogr D Struct Biol, 2019. 75(Pt 4): pp. 451–454.
The Origin of Genomic Information | 37

51. Berman, H.M., The Protein Data Bank: a historical perspective. Acta Crystallogr A, 2008. 64(Pt 1): pp. 88–95.
52. Brown, I.D., CIF (Crystallographic Information File): A Standard for Crystallographic Data Interchange. J Res Natl
Inst Stand Technol, 1996. 101(3): pp. 341–346.
53. Siegel, R.D. and C.G. Prober, Classification of Viruses. Principles and Practice of Pediatric Infectious Disease.
2008. 1001–1005. doi: 10.1016/B978-0-7020-3468-8.50207-8. Epub 2020 June 22.
54. Dimitrov, D.S., Virus entry: molecular mechanisms and biomedical applications. Nat Rev Microbiol, 2004.
2(2): pp. 109–122.
55. Christopher J. Burrell, C.R.H.a.F.A.M., Fenner and White’s Medical Virology. 5th ed. 2016: Cambridge,
MA: Academic Press.
56. Yesudhas, D., A. Srivastava, and M.M. Gromiha, COVID-19 outbreak: history, mechanism, transmission, structural
studies and therapeutics. Infection, 2021. 49(2): pp. 199–213.
The Origin of Genomic Information
Cochrane, G. , et al., The International Nucleotide Sequence Database Collaboration . Nucleic Acids Res, 2011. 39 (Database issue):
pp. D15–18.
Hetzer, M. and G. Cavalli , Eukaryotic cells . Curr Opin Cell Biol, 2011. 23 (3): pp. 255–257.
Souza, W. , Prokaryotic cells: structural organisation of the cytoskeleton and organelles . Mem Inst Oswaldo Cruz, 2012. 107 (3): pp.
283–293.
Benbow, R.M. , Chromosome structures . Sci Prog, 1992. 76 (301-302 Pt 3–4): pp. 425–450.
Schreck, R.R. and C.M. Disteche , Chromosome banding techniques . Curr Protoc Hum Genet, 2001. Chapter 4: Unit4 2.
Bolcun-Filas, E. and M.A. Handel , Meiosis: the chromosomal foundation of reproduction . Biol Reprod, 2018. 99 (1): pp. 112–126.
Garbers, D.L. , Molecular basis of fertilization . Annu Rev Biochem, 1989. 58 : pp. 719–742.
Georgadaki, K. , et al., The molecular basis of fertilization (Review) . Int J Mol Med, 2016. 38 (4): pp. 979–986.
Vaillancourt, C. and J. Lafond , Human embryogenesis: overview . Methods Mol Biol, 2009. 550 : pp. 3–7.
Margolin, W. , FtsZ and the division of prokaryotic cells and organelles . Nat Rev Mol Cell Biol, 2005. 6 (11): pp. 862–871.
Watson, J.D. and F.H. Crick , Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid . Clin Orthop Relat Res,
2007. 462 : pp. 3–5.
Green, E.D. , J.D. Watson , and F.S. Collins , Human Genome Project: Twenty-five years of big biology . Nature, 2015. 526 (7571):
pp. 29–31.
Salzberg, S.L. , Open questions: How many genes do we have? BMC Biol, 2018. 16 (1): p. 94.
Bruford, E.A. , et al., Guidelines for human gene nomenclature . Nat Genet, 2020. 52 (8): pp. 754–758.
Wojcik, F. , [Guidelines for human gene nomenclature] . Ann Biol Clin (Paris), 2002. 60 (3): pp. 347–350.
Stringer, C. , The origin and evolution of Homo sapiens . Philos Trans R Soc Lond B Biol Sci, 2016. 371 (1698).
Spieth, J. and D. Lawson , Overview of gene structure . WormBook, 2006: pp. 1–10.
Spieth, J. , et al., Overview of gene structure in C. elegans . WormBook, 2014: pp. 1–18.
Haberle, V. and A. Stark , Eukaryotic core promoters and the functional basis of transcription initiation . Nat Rev Mol Cell Biol, 2018.
19 (10): pp. 621–637.
IUPAC-IUB Commission on Biochemical Nomenclature (CBN). Abbreviations and symbols for nucleic acids, polynucleotides and their
constituents. Recommendations 1970 . Eur J Biochem, 1970. 15 (2): pp. 203–208.
Shu, J.J. , A new integrated symmetrical table for genetic codes . Biosystems, 2017. 151 : pp. 21–26.
Pennisi, E. , Genomics. ENCODE project writes eulogy for junk DNA . Science, 2012. 337 (6099): pp. 1159, 1161.
Fan, H. and J.Y. Chu , A brief review of short tandem repeat mutation . Genomics Proteomics Bioinformatics, 2007. 5 (1): pp. 7–14.
Ponicsan, S.L. , J.F. Kugel , and J.A. Goodrich , Genomic gems: SINE RNAs regulate mRNA production . Curr Opin Genet Dev, 2010.
20 (2): pp. 149–155.
Nelson, P.N. , et al., Human endogenous retroviruses: transposable elements with potential? Clin Exp Immunol, 2004. 138 (1): pp.
1–9.
Cordaux, R. and M.A. Batzer , The impact of retrotransposons on human genome evolution . Nat Rev Genet, 2009. 10 (10): pp.
691–703.
Bourque, G. , et al., Ten things you should know about transposable elements . Genome Biol, 2018. 19 (1): p. 199.
Zheng, D. , et al., Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution . Genome
Res, 2007. 17 (6): pp. 839–851.
Torrents, D. , et al., A genome-wide survey of human pseudogenes . Genome Res, 2003. 13 (12): pp. 2559–2567.
Bischof, J.M. , et al., Genome-wide identification of pseudogenes capable of disease-causing gene conversion . Hum Mutat, 2006. 27
(6): pp. 545–552.
Turner, K.J. , V. Vasu and D.K. Griffin , Telomere Biology and Human Phenotype . Cells, 2019. 8 (1).
Aguado, J. , F. d’Adda di Fagagna and E. Wolvetang , Telomere transcription in ageing . Ageing Res Rev, 2020. 62 : pp. 101–115.
Lodish H.B.A. , S.L. Zipursky , et al., Molecular Cell Biology. 4th ed. 2000 . New York: Freeman.
Yusupov, M.M. , et al., Crystal structure of the ribosome at 5.5 A resolution . Science, 2001. 292 (5518): pp. 883–896.
Torres-Machorro, A.L. , et al., Ribosomal RNA genes in eukaryotic microorganisms: witnesses of phylogeny? FEMS Microbiol Rev,
2010. 34 (1): pp. 59–86.
Long, E.O. and I.B. Dawid , Repeated genes in eukaryotes . Annu Rev Biochem, 1980. 49 : pp. 727–764.
Sollner-Webb, B. and E.B. Mougey , News from the nucleolus: rRNA gene expression . Trends Biochem Sci, 1991. 16 (2): pp. 58–62.
Rich, A. , A Hybrid Helix Containing Both Deoxyribose and Ribose Polynucleotides and Its Relation to the Transfer of Information
between the Nucleic Acids . Proc Natl Acad Sci USA, 1960. 46 (8): pp. 1044–1053.
Bartel, D.P. , Metazoan MicroRNAs . Cell, 2018. 173 (1): pp. 20–51.
Bartel, D.P. , MicroRNAs: target recognition and regulatory functions . Cell, 2009. 136 (2): pp. 215–233.
Watson, J.D. , Molecular Biology of the Gene. 7th ed. 2013. Pearson.
Eisenberg, E. and E.Y. Levanon , Human housekeeping genes are compact . Trends Genet, 2003. 19 (7): pp. 362–365.
Eisenberg, E. and E.Y. Levanon , Human housekeeping genes, revisited . Trends Genet, 2013. 29 (10): pp. 569–574.
Alberts B.J.A., J. Lewis, M. Raff, K. Roberts, and .P Walter, Molecular Biology of the Cell. 6th ed. 2008. New York: Garland Science.
Pan, Q. , et al., Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing . Nat
Genet, 2008. 40 (12): pp. 1413–1415.
Cooper, G. , The Cell: A Molecular Approach. 8th ed. 2018: Oxford: Oxford University Press.
Nelson, D.L. , Lehninger Principles of Biochemistry. 7th ed. 2017: New York: W.H. Freeman.
Zwanzig, R. , A. Szabo , and B. Bagchi , Levinthal’s paradox . Proc Natl Acad Sci USA, 1992. 89 (1): pp. 20–22.
Pauling, L. , R.B. Corey , and H.R. Branson , The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide
chain . Proc Natl Acad Sci USA, 1951. 37 (4): pp. 205–211.
Adams, P.D. , et al., Announcing mandatory submission of PDBx/mmCIF format files for crystallographic depositions to the Protein
Data Bank (PDB) . Acta Crystallogr D Struct Biol, 2019. 75 (Pt 4): pp. 451–454.
Berman, H.M. , The Protein Data Bank: a historical perspective . Acta Crystallogr A, 2008. 64 (Pt 1): pp. 88–95.
Brown, I.D. , CIF (Crystallographic Information File): A Standard for Crystallographic Data Interchange . J Res Natl Inst Stand
Technol, 1996. 101 (3): pp. 341–346.
Siegel, R.D. and C.G. Prober , Classification of Viruses . Principles and Practice of Pediatric Infectious Disease. 2008. 1001–1005.
doi: 10.1016/B978-0-7020-3468-8.50207-8 . Epub 2020 June 22.
Dimitrov, D.S. , Virus entry: molecular mechanisms and biomedical applications . Nat Rev Microbiol, 2004. 2 (2): pp. 109–122.
Christopher J. Burrell , C.R.H.a.F.A.M., Fenner and White’s Medical Virology. 5th ed. 2016: Cambridge, MA: Academic Press.
Yesudhas, D. , A. Srivastava and M.M. Gromiha , COVID-19 outbreak: history, mechanism, transmission, structural studies and
therapeutics . Infection, 2021. 49 (2): pp. 199–213.

The Sources of Genomic Data

Krings, M. , et al., Neandertal DNA sequences and the origin of modern humans . Cell, 1997. 90 (1): pp. 19–30.
Green, R.E. , et al., A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing . Cell, 2008.
134 (3): pp. 416–426.
Prufer, K. , et al., The complete genome sequence of a Neanderthal from the Altai Mountains . Nature, 2014. 505 (7481): pp. 43–49.
Kap, M. , et al., Histological assessment of PAXgene tissue fixation and stabilization reagents . PLoS One, 2011. 6 (11): p. e27704.
Deissler, H. , et al., Rapid protein sequencing by tandem mass spectrometry and cDNA cloning of p20-CGGBP. A novel protein that
binds to the unstable triplet repeat 5’-d(CGG)n-3’ in the human FMR1 gene . J Biol Chem, 1997. 272 (27): pp. 16761–16768.
Hunt, D.F. , et al., Protein sequencing by tandem mass spectrometry . Proc Natl Acad Sci USA, 1986. 83 (17): pp. 6233–7.
Nardiello, D. , et al., Strategies in protein sequencing and characterization: multi-enzyme digestion coupled with alternate CID/ETD
tandem mass spectrometry . Anal Chim Acta, 2015. 854 : pp. 106–117.
Ziady, A.G. and M. Kinter , Protein sequencing with tandem mass spectrometry . Methods Mol Biol, 2009. 544 : pp. 325–341.
Desjardins, P . and D. Conklin , NanoDrop microvolume quantitation of nucleic acids . J Vis Exp, 2010(45).
Schroeder, A. , et al., The RIN: an RNA integrity number for assigning integrity values to RNA measurements . BMC Mol Biol, 2006. 7:
p. 3.
Smith, P.K. , et al., Measurement of protein using bicinchoninic acid . Anal Biochem, 1985. 150 (1): pp. 76–85.
Olson, B.J. and J. Markwell , Assays for determination of protein concentration . Curr Protoc Protein Sci, 2007. Chapter 3: Unit 3 4.
Engvall, E. , The ELISA, enzyme-linked immunosorbent assay . Clin Chem, 2010. 56 (2): pp. 319–320.
Temin, H.M. and S. Mizutani , RNA-dependent DNA polymerase in virions of Rous sarcoma virus. 1970 . Biotechnology, 1992. 24 :
pp. 51–56.
Kary, B., F.F. Mullis, and Richard A. Gibbs, The Polymerase Chain Reaction. 1996: Boston: Birkhäuser.
Chien, A. , D.B. Edgar , and J.M. Trela , Deoxyribonucleic acid polymerase from the extreme thermophile Thermus aquaticus . J
Bacteriol, 1976. 127 (3): pp. 1550–1557.
Stretton, A.O. , The first sequence. Fred Sanger and insulin . Genetics, 2002. 162 (2): pp. 527–532.
Holley, R.W. , et al., Structure of a Ribonucleic Acid . Science, 1965. 147 (3664): pp. 1462–1465.
Xue, Y. , Y. Wang , and H. Shen , Ray Wu, fifth business or father of DNA sequencing? Protein Cell, 2016. 7 (7): pp. 467–470.
Maxam, A.M. and W. Gilbert , A new method for sequencing DNA . Proc Natl Acad Sci USA, 1977. 74 (2): pp. 560–564.
Sanger, F. and A.R. Coulson , A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase . J Mol
Biol, 1975. 94 (3): pp. 441–448.
Staden, R. , A strategy of DNA sequencing employing computer programs . Nucleic Acids Res, 1979. 6 (7): pp. 2601–2610.
Pareek, C.S. , R. Smoczynski and A. Tretyn , Sequencing technologies and genome sequencing . J Appl Genet, 2011. 52 (4): pp.
413–435.
Shampo, M.A. and R.A. Kyle , J. Craig Venter—The Human Genome Project . Mayo Clin Proc, 2011. 86 (4): pp. e26–27.
Ewing, B. , et al., Base-calling of automated sequencer traces using phred. I. Accuracy assessment . Genome Res, 1998. 8 (3): pp.
175–185.
Zhang, H. , Overview of Sequence Data Formats . Methods Mol Biol, 2016. 1418 : pp. 3–17.
Behjati, S. and P.S. Tarpey , What is next-generation sequencing? Arch Dis Child Educ Pract Ed, 2013. 98 (6): pp. 236–238.
Meyer, M. and M. Kircher , Illumina sequencing library preparation for highly multiplexed target capture and sequencing . Cold Spring
Harb Protoc, 2010. (6): pdb prot5448.
Mardis, E.R. , DNA sequencing technologies: 2006–2016 . Nat Protoc, 2017. 12 (2): pp. 213–218.
Ramskold, D. , et al., Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells . Nat Biotechnol,
2012. 30 (8): pp. 777–782.
Ravi, R.K. , K. Walton and M. Khosroheidari , MiSeq: A Next-generation sequencing Platform for Genomic Analysis . Methods Mol
Biol, 2018. 1706 : pp. 223–232.
Mohideen, A. , S.D. Johansen , and I. Babiak , High-Throughput Identification of Adapters in Single-Read Sequencing Data .
Biomolecules, 2020. 10 (6).
Campbell, P.J. , et al., Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-
end sequencing . Nat Genet, 2008. 40 (6): pp. 722–729.
Ewing, B. and P. Green , Base-calling of automated sequencer traces using phred. II. Error probabilities . Genome Res, 1998. 8 (3):
pp. 186–194.
Cock, P.J. , et al., The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants . Nucleic
Acids Res, 2010. 38 (6): pp. 1767–1771.
Torri, F. , et al., Next generation sequence analysis and computational genomics using graphical pipeline workflows . Genes (Basel),
2012. 3 (3): pp. 545–575.
Andrews, S ., FASTQC a quality control tool for high throughput sequence data . www.bioinformatics.babraham.ac.uk/projects/fastqc.
2019.
Li, H. and R. Durbin , Fast and accurate short read alignment with Burrows-Wheeler transform . Bioinformatics, 2009. 25 (14): pp.
1754–1760.
Langmead, B. and S.L. Salzberg , Fast gapped-read alignment with Bowtie 2 . Nat Methods, 2012. 9 (4): pp. 357–359.
Li, H. , J. Ruan and R. Durbin , Mapping short DNA sequencing reads and calling variants using mapping quality scores . Genome
Res, 2008. 18 (11): pp. 1851–1858.
Lunter, G. and M. Goodson , Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads . Genome Res,
2011. 21 (6): pp. 936–939.
Yu, X. , et al., How do alignment programs perform on sequencing data with varying qualities and from repetitive regions? BioData
Min, 2012. 5 (1): p. 6.
Pedersen, B.S. and A.R. Quinlan , Mosdepth: quick coverage calculation for genomes and exomes . Bioinformatics, 2018. 34 (5): pp.
867–868.
Sims, D. , et al., Sequencing depth and coverage: key considerations in genomic analyses . Nat Rev Genet, 2014. 15 (2): pp.
121–132.
Illumina . Coverage depth recommendations . www.illumina.com/science/technology/next-generation-sequencing/plan-
experiments/coverage.html. 2021.
Schatz, M.C. , A.L. Delcher , and S.L. Salzberg , Assembly of large genomes using second-generation sequencing . Genome Res,
2010. 20 (9): pp. 1165–1173.
Li, H. , et al., The Sequence Alignment/Map format and SAMtools . Bioinformatics, 2009. 25 (16): pp. 2078–2079.
Simpson, J.T. , et al., ABySS: a parallel assembler for short read sequence data . Genome Res, 2009. 19 (6): pp. 1117–1123.
Xie, Y. , et al., SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads . Bioinformatics, 2014. 30 (12): pp.
1660–1666.
McKenna, A. , et al., The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data .
Genome Res, 2010. 20 (9): pp. 1297–1303.
Danecek, P. , et al., The variant call format and VCFtools . Bioinformatics, 2011. 27 (15): pp. 2156–2158.
Yang, H. and K. Wang , Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR . Nat Protoc, 2015. 10 (10):
pp. 1556–1566.
Cingolani, P. , et al., A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the
genome of Drosophila melanogaster strain w1118; iso-2; iso-3 . Fly (Austin), 2012. 6 (2): pp. 80–92.
Adzhubei, I. , D.M. Jordan , and S.R. Sunyaev , Predicting functional effect of human missense mutations using PolyPhen-2 . Curr
Protoc Hum Genet, 2013. Chapter 7: Unit7 20.
Richards, S. , et al., Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the
American College of Medical Genetics and Genomics and the Association for Molecular Pathology . Genet Med, 2015. 17 (5): pp.
405–424.

The NCBI Entrez Databases

Bethesda (MD) : National Library of Medicine (US), N.C.f.B.I. National Center for Biotechnology Information (NCBI) [cited 2021
4/6/2021]. www.ncbi.nlm.nih.gov/. 1988.
Benson, D.A. , et al., GenBank . Nucleic Acids Res, 2018. 46 (D1): pp. D41–D47.
Tatusova, T. , et al., NCBI prokaryotic genome annotation pipeline . Nucleic Acids Res, 2016. 44 (14): pp. 6614–6624.
Murphy, M.B.G., C. Wallin, et al. Gene Help: Integrated Access to Genes of Genomes in the Reference Sequence Collection [updated
2021] [cited 2021 5/1/2021]. www.ncbi.nlm.nih.gov/books/NBK3841/. 2006.
Anders, S. , P.T. Pyl , and W. Huber , HTSeq—a Python framework to work with high-throughput sequencing data . Bioinformatics,
2015. 31 (2): pp. 166–9.
Sharma, S. , et al., The NCBI BioCollections Database . Database, 2019.
Sharma, S. , et al., The NCBI BioCollections Database. 2018. Oxford: Database (Oxford).
Figueira, R. and F. Lages , Museum and Herbarium Collections for Biodiversity Research in Angola , in Biodiversity of Angola:
Science & Conservation: A Modern Synthesis, B.J. Huntley , et al., Editors. 2019. Cham: Springer International Publishing: pp.
513–542.
Martin, N.A. , Voucher specimens: A way to protect the value of your research . Biology and Fertility of Soils, 1990. 9 (2): pp. 93–94.
Labeda, D.P. , Culture Collections: An Essential Resource for Microbiology , in Bergey’s Manual® of Systematic Bacteriology: Volume
One: The Archaea and the Deeply Branching and Phototrophic Bacteria, D.R. Boone , R.W. Castenholz , and G.M. Garrity , Editors.
2001. New York: Springer New York: pp. 111–113.
NCBI Resource Coordinators , Database resources of the National Center for Biotechnology Information . Nucleic Acids Res, 2018. 46
(D1): pp. D8–D13.
Koonin, E.V. , Orthologs, paralogs, and evolutionary genomics . Annu Rev Genet, 2005. 39 : pp. 309–338.
Klimke, W. , et al., The National Center for Biotechnology Information’s Protein Clusters Database . Nucleic Acids Res, 2009. 37
(Database issue): pp. D216–223.
Belouzard, S. , et al., Mechanisms of coronavirus cell entry mediated by the viral spike protein . Viruses, 2012. 4 (6): pp. 1011–1033.
Marchler-Bauer, A. , et al., CDD: conserved domains and protein three-dimensional structure . Nucleic Acids Res, 2013. 41 (Database
issue): pp. D348–352.
Corrales, P. , A. Vidal-Puig , and G. Medina-Gomez , PPARs and Metabolic Disorders Associated with Challenged Adipose Tissue
Plasticity . Int J Mol Sci, 2018. 19 (7).
Wu, C.H. , et al., The Protein Information Resource . Nucleic Acids Res, 2003. 31 (1): pp. 345–347.
Barker, W.C. , F. Pfeiffer and D.G. George , Superfamily classification in PIR-International Protein Sequence Database . Methods
Enzymol, 1996. 266 : pp. 59–71.
Johnson, L.S. , S.R. Eddy , and E. Portugaly , Hidden Markov model speed heuristic and iterative HMM search procedure . BMC
Bioinformatics, 2010. 11 : p. 431.
Punta, M. , et al., The Pfam protein families database . Nucleic Acids Res, 2012. 40 (Database issue): pp. D290–301.
Sigrist, C.J. , et al., PROSITE: a documented database using patterns and profiles as motif descriptors . Brief Bioinform, 2002. 3 (3):
pp. 265–274.
Sigrist, C.J. , et al., New and continuing developments at PROSITE . Nucleic Acids Res, 2013. 41 (Database issue): pp. D344–347.
Sigrist, C.J. , et al., ProRule: a new database containing functional and structural information on PROSITE profiles . Bioinformatics,
2005. 21 (21): pp. 4060–4066.
Murzin, A.G. , et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures . J Mol
Biol, 1995. 247 (4): pp. 536–540.
Huang, H. , et al., iProClass: an integrated database of protein family, function and structure information . Nucleic Acids Res, 2003. 31
(1): pp. 390–392.
Haft, D.H. , et al., RefSeq: an update on prokaryotic genome annotation and curation . Nucleic Acids Res, 2018. 46 (D1): pp.
D851–D860.
Schuster-Böckler, B. , J. Schultz and S. Rahmann , HMM Logos for visualization of protein families . BMC Bioinformatics, 2004. 5 (1):
p. 7.
Li, W. , et al., RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation . Nucleic Acids
Res, 2021. 49 (D1): pp. D1020–D1028.
Geer, L.Y. , et al., CDART: protein homology by domain architecture . Genome Res, 2002. 12 (10): pp. 1619–1623.
HomoloGene , in Encyclopedia of Genetics, Genomics, Proteomics and Informatics. 2008. Dordrecht: Springer Netherlands. pp.
899–899.
Barrett, T. , et al., BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata . Nucleic Acids Res,
2012. 40 (Database issue): pp. D57–63.
Field, D. , et al., The Genomic Standards Consortium . PLOS Biology, 2011. 9 (6): pp. e1001088.
Field, D. , et al., The minimum information about a genome sequence (MIGS) specification . Nat Biotechnol, 2008. 26 (5): pp.
541–547.
Bowers, R.M. , et al., Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome
(MIMAG) of bacteria and archaea . Nat Biotechnol, 2017. 35 (8): pp. 725–731.
Yilmaz, P. , et al., Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence
(MIxS) specifications . Nat Biotechnol, 2011. 29 (5): pp. 415–420.
Roux, S. , et al., Minimum Information about an Uncultivated Virus Genome (MIUViG) . Nature Biotechnology, 2019. 37 (1): pp.
29–37.
Sherry, S.T. , et al., dbSNP: the NCBI database of genetic variation . Nucleic Acids Res, 2001. 29 (1): pp. 308–311.
Smigielski, E.M. , et al., dbSNP: a database of single nucleotide polymorphisms . Nucleic Acids Res, 2000. 28 (1): pp. 352–355.
National Center for Biotechnology Information , N.L.o.M. dbSNP Submission Documentation Overview .
www.ncbi.nlm.nih.gov/snp/docs/submission/. 2017.
Kitts A.P.L. , M. Ward , et al., The Database of Short Genetic Variation (dbSNP). 2nd ed. The NCBI Handbook [Internet]. 2013.
Bethesda, MD: National Center for Biotechnology Information (US).
L. Phan , Y.J.H. Zhang , W. Qiang , E. Shekhtman , D. Shao , D. Revoe , R. Villamarin , E. Ivanchenko , M. Kimura , Z. Y. Wang , L.
Hao , N. Sharopova , M. Bihan , A. Sturcke , M. Lee , N. Popova , W. Wu , C. Bastiani , M. Ward , J. B. Holmes , V. Lyoshin , K. Kaur ,
E. Moyer , M. Feolo , and B. L. Kattman . ALFA: Allele Frequency Aggregator [cited 2021 5/14/2021].
www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/. 2020.
Bomba, L. , K. Walter , and N. Soranzo , The impact of rare and low-frequency genetic variants in common disease . Genome Biology,
2017. 18 (1): p. 77.
Fernandez-Moya, A. , et al., Germline Variants in Driver Genes of Breast Cancer and Their Association with Familial and Early-Onset
Breast Cancer Risk in a Chilean Population . Cancers (Basel), 2020. 12 (1).
Kap, M. , et al., Histological assessment of PAXgene tissue fixation and stabilization reagents . PLoS One, 2011. 6 (11): p. e27704.
den Dunnen, J.T. , et al., HGVS Recommendations for the Description of Sequence Variants: 2016 Update . Hum Mutat, 2016. 37 (6):
pp. 564–569.
Auton, A. , et al., A global reference for human genetic variation . Nature, 2015. 526 (7571): pp. 68–74.
Karczewski, K.J. , et al., The mutational constraint spectrum quantified from variation in 141,456 humans . Nature, 2020. 581 (7809):
pp. 434–443.
Karczewski, K.J. , et al., The ExAC browser: displaying reference data information from over 60 000 exomes . Nucleic Acids Res,
2017. 45 (D1): pp. D840–D845.
Taliun, D. , et al., Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program . Nature, 2021. 590 (7845): pp.
290–299.
Lappalainen, I. , et al., DbVar and DGVa: public archives for genomic structural variation . Nucleic Acids Res, 2013. 41 (Database
issue): pp. D936–941.
Kusenda, M. and J. Sebat , The role of rare structural variants in the genetics of autism spectrum disorders . Cytogenet Genome Res,
2008. 123 (1–4): pp. 36–43.
Landrum, M.J. , et al., ClinVar: improvements to accessing data . Nucleic Acids Res, 2020. 48 (D1): pp. D835–D844.
Brookes, A.J. and P.N. Robinson , Human genotype–phenotype databases: aims, challenges and opportunities . Nature Reviews
Genetics, 2015. 16 (12): pp. 702–715.
Holmes, J.B. , et al., SPDI: data model for variants and applications at NCBI . Bioinformatics, 2020. 36 (6): pp. 1902–1907.
Landrum, M.J. , et al., ClinVar: improving access to variant interpretations and supporting evidence . Nucleic Acids Res, 2018. 46
(D1): pp. D1062–D1067.
Tarca, A.L. , R. Romero , and S. Draghici , Analysis of microarray experiments of gene expression profiling . Am J Obstet Gynecol,
2006. 195 (2): pp. 373–388.
Wong, M.L. and J.F. Medrano , Real-time PCR for mRNA quantitation . BioTechniques, 2005. 39 (1): pp. 75–85.
Teng, M. , et al., A benchmark for RNA-seq quantification pipelines . Genome Biology, 2016. 17 (1): p. 74.
Edgar, R. , M. Domrachev and A.E. Lash , Gene Expression Omnibus: NCBI gene expression and hybridization array data repository .
Nucleic Acids Res, 2002. 30 (1): pp. 207–210.
Barrett, T. , et al., NCBI GEO: archive for functional genomics data sets—update . Nucleic Acids Research, 2012. 41 (D1): pp.
D991–D995.
Reinartz, J. , et al., Massively parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression profiling in all
organisms . Briefings in Functional Genomics, 2002. 1 (1): pp. 95–104.
Alfred, J. , Golden path to genome . Nature Reviews Genetics, 2000. 1 (2): p. 87.
Simao, F.A. , et al., BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs . Bioinformatics,
2015. 31 (19): pp. 3210–3212.
Consortium, G.R. GRC and Collaborators . www.ncbi.nlm.nih.gov/grc/credits/. 2021.
Regular Expressions Quick Start [cited 2021 5/23/2021]. www.regular-expressions.info/quickstart.html. 2020.
Birney, E. and N. Soranzo , The end of the start for population sequencing . Nature, 2015. 526 (7571): pp. 52–53.
Fairley, S. , et al., The International Genome Sample Resource (IGSR) collection of open human genomic variation resources .
Nucleic Acids Research, 2019. 48 (D1): pp. D941–D947.
Online Mendelian Inheritance in Man, OMIM® [cited 2021 5/26/2021]. https://omim.org/. 2021.
Amberger, J.S. , et al., OMIM.org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of human genes and genetic
disorders . Nucleic Acids Res, 2015. 43 (Database issue): pp. D789–798.
Database resources of the National Center for Biotechnology Information . Nucleic Acids Res, 2013. 41 (Database issue): pp.
D8–D20.
Larkin, M.A. , et al., Clustal W and Clustal X version 2.0 . Bioinformatics, 2007. 23 (21): pp. 2947–2948.
Edgar, R.C. , MUSCLE: multiple sequence alignment with high accuracy and high-throughput . Nucleic Acids Res, 2004. 32 (5): pp.
1792–1797.
Di Tommaso, P. , et al., T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural
information and homology extension . Nucleic Acids Res, 2011. 39 (Web Server issue): pp. W13–W17.
Nelson David L. and M.M. Cox , Lehninger principles of biochemistry. 15th ed. 2008. New York: W.H. Freeman.
Tardito, D. , et al., Signaling pathways regulating gene expression, neuroplasticity, and neurotrophic mechanisms in the action of
antidepressants: a critical overview . Pharmacol Rev, 2006. 58 (1): pp. 115–134.
Ochsner, S.A. , et al., The Signaling Pathways Project, an integrated ‘omics knowledgebase for mammalian cellular signaling
pathways . Scientific Data, 2019. 6 (1): p. 252.
Kanehisa, M. , et al., KEGG for linking genomes to life and the environment . Nucleic Acids Res, 2008. 36 (Database issue): pp.
D480–D484.
BioCarta . Biotech Software & Internet Report, 2001. 2 (3): pp. 117–120.
Karp, P.D. , et al., Expansion of the BioCyc collection of pathway/genome databases to 160 genomes . Nucleic Acids Res, 2005. 33
(19): pp. 6083–6089.
Thomas, P.D. , et al., PANTHER: a browsable database of gene products organized by biological function, using curated protein
family and subfamily classification . Nucleic Acids Res, 2003. 31 (1): pp. 334–341.
Schaefer, C.F. , et al., PID: the Pathway Interaction Database . Nucleic Acids Res, 2009. 37 (Database issue): pp. D674–D679.
Matthews, L. , et al., Reactome knowledgebase of human biological pathways and processes . Nucleic Acids Res, 2009. 37
(Database issue): pp. D619–D622.
Pico, A.R. , et al., WikiPathways: pathway editing for the people . PLoS Biol, 2008. 6 (7): p. e184.
Gene Ontology Consortium , The Gene Ontology project in 2008 . Nucleic Acids Res, 2008. 36 (Database issue): pp. D440–D444.
Geer, L.Y. , et al., The NCBI BioSystems database . Nucleic Acids Res, 2010. 38 (Database issue): pp. D492–D496.
NCBI BioSystems Database [cited 2021 5/29/2021]. www.ncbi.nlm.nih.gov/Structure/biosystems/docs/biosystems_about.html.
Tryka, K.A. , et al., NCBI’s Database of Genotypes and Phenotypes: dbGaP . Nucleic Acids Res, 2014. 42 (Database issue): pp.
D975–D979.
Mailman, M.D. , et al., The NCBI dbGaP database of genotypes and phenotypes . Nat Genet, 2007. 39 (10): pp. 1181–1186.
Sweetlove, L. , Number of species on Earth tagged at 8.7 million . Nature, 2011.
The Levels of Classification. 2021.
Schoch, C.L. , et al., NCBI Taxonomy: a comprehensive update on curation, resources and tools. Oxford: Database (Oxford), 2020.
NLM . MEDLINE: Overview [cited 2021 4/16/2021]. www.nlm.nih.gov/medline/medline_overview.html. 2021.

NCBI Entrez E-Utilities and Applications

NCBI Resource Coordinators , Database resources of the National Center for Biotechnology Information . Nucleic Acids Res, 2018. 46
(D1): p. D8–D13.
Sayers, E. , A General Introduction to the E-utilities . Entrez Programming Utilities Help [cited 2021 6/6/2021].
www.ncbi.nlm.nih.gov/books/NBK25497/. 2010.
Sayers, E. , The E-utilities In-Depth: Parameters, Syntax and More . Entrez Programming Utilities Help 2009 5/15/2021 [cited 2021
6/25/2021]. https://www.ncbi.nlm.nih.gov/books/NBK25499/.
The Entrez Direct
Sayers, E. , A General Introduction to the E-utilities . Entrez Programming Utilities Help [cited 6/6/2021].
www.ncbi.nlm.nih.gov/books/NBK25497/. 2010.
Sayers, E. , The E-utilities In-Depth: Parameters, Syntax and More . Entrez Programming Utilities Help [cited 6/25/2021].
www.ncbi.nlm.nih.gov/books/NBK25499/. 2009.
Kans, J ., Entrez Direct: E-utilities on the Unix Command Line . In: Entrez Programming Utilities Help [Internet] [cited 6/25/2021].
www.ncbi.nlm.nih.gov/books/NBK179288/. 2013 updated 2021.

R and Python Packages for the NCBI E-Utilities

Sayers, E. , The E-utilities In-Depth: Parameters, Syntax and More . Entrez Programming Utilities Help [cited 6/25/2021].
www.ncbi.nlm.nih.gov/books/NBK25499/. 2009.
Winter, D.J. , rentrez: An R package for the NCBI eUtils API . R Journal, 2017. 9 : pp. 520–526.
Winter, D. Rentrez Tutorial [cited 2021 6/29/2021]. https://cran.r-project.org/web/packages/rentrez/vignettes/rentrez_tutorial.html.
2020.
Cock, P.J. , et al., Biopython: freely available Python tools for computational molecular biology and bioinformatics . Bioinformatics,
2009. 25 (11): pp. 1422–1423.

Pairwise Sequence Alignment

Levenshtein, V. , Binary codes capable of correcting deletions, insertions, and reversals . Soviet Physics . Doklady, 1965. 10 : pp.
707–710.
Sackton, T.B. and N. Clark , Convergent evolution in the genomics era: new insights and directions . Philos Trans R Soc Lond B Biol
Sci, 2019. 374 (1777): p. 20190102.
Dayhoff, M.O. , R.M. Schwartz , and B.C. Orcutt A Model of Evolutionary Change in Proteins, in Atlas of Protein Sequence and
Structure, M.O. Dayhoff , Editor. 1978. Washington, DC: National Biomedical Research Foundation.
Henikoff, S. and J.G. Henikoff , Amino acid substitution matrices from protein blocks . Proc Natl Acad Sci USA, 1992. 89 (22): pp.
10915–10919.
Needleman, S.B. and C.D. Wunsch , A general method applicable to the search for similarities in the amino acid sequence of two
proteins . J Mol Biol, 1970. 48 (3): pp. 443–453.
Smith, T.F. and M.S. Waterman , Identification of common molecular subsequences . J Mol Biol, 1981. 147 (1): pp. 195–197.
Altschul, S.F. , et al., Basic local alignment search tool. J Mol Biol, 1990. 215 (3): pp. 403–410.
Wootton, J.C. and S. Federhen , Statistics of local complexity in amino acid sequences and sequence databases . Computers &
Chemistry, 1993. 17 (2): pp. 149–163.
Schmitt, A.O. and H. Herzel , Estimating the Entropy of DNA Sequences . Journal of Theoretical Biology, 1997. 188 (3): pp. 369–377.
Altschul, S.F. , et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs . Nucleic Acids Res,
1997. 25 (17): pp. 3389–3402.

Basic Local Alignment Search Tool

Altschul, S.F. , et al., Basic local alignment search tool . J Mol Biol, 1990. 215 (3): pp. 403–410.
Woese, C.R. and G.E. Fox , Phylogenetic structure of the prokaryotic domain: the primary kingdoms . Proc Natl Acad Sci USA, 1977.
74 (11): pp. 5088–5090.
Garrity, G.M. , A New Genomics-Driven Taxonomy of Bacteria and Archaea: Are We There Yet? J Clin Microbiol, 2016. 54 (8): pp.
1956–1963.
Tao, T. Standalone BLAST Setup for Windows PC. BLAST® Help [cited 2021]. www.ncbi.nlm.nih.gov/books/NBK52637/. 2010.

View publication stats

(Chapman & Hall - CRC Computational Biology Series) Hamid Ismail - Bioinformatics - A Practical Guide To NCBI Databases and Sequence Alignments-CRC Press (2021)
100% (1)
(Chapman & Hall - CRC Computational Biology Series) Hamid Ismail - Bioinformatics - A Practical Guide To NCBI Databases and Sequence Alignments-CRC Press (2021)
469 pages
BI Unit 1 Part-1
No ratings yet
BI Unit 1 Part-1
24 pages
Bioinformatics A Practical Guide To Next Generation Sequencing Data
100% (1)
Bioinformatics A Practical Guide To Next Generation Sequencing Data
349 pages
Bio in For Matics
100% (1)
Bio in For Matics
160 pages
Bioinformatics PPT Section B Data Storage and Retrival Group 3
No ratings yet
Bioinformatics PPT Section B Data Storage and Retrival Group 3
36 pages
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
Bioinformatics and Functional Genomics - Ebook PDF
No ratings yet
Bioinformatics and Functional Genomics - Ebook PDF
51 pages
Bookshelf NBK21101
100% (1)
Bookshelf NBK21101
451 pages
CHAP 12 - Bioinformatics - Research Application
No ratings yet
CHAP 12 - Bioinformatics - Research Application
8 pages
Bioinformatics: A Practical Guide to Next Generation Sequencing Data Analysis (Chapman & Hall/CRC Computational Biology Series) 1st Edition Hamid D. Ismail - The ebook in PDF/DOCX format is ready for download now
No ratings yet
Bioinformatics: A Practical Guide to Next Generation Sequencing Data Analysis (Chapman & Hall/CRC Computational Biology Series) 1st Edition Hamid D. Ismail - The ebook in PDF/DOCX format is ready for download now
75 pages
Bioinformatics Notes 2020 2021
No ratings yet
Bioinformatics Notes 2020 2021
66 pages
Ismail H. Bioinformatics. A Practical Guide... Sequencing Data Analysis 2023
No ratings yet
Ismail H. Bioinformatics. A Practical Guide... Sequencing Data Analysis 2023
349 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Bioinformatics
No ratings yet
Bioinformatics
55 pages
BIOINFOMATICS - Information Sources and Applications
No ratings yet
BIOINFOMATICS - Information Sources and Applications
80 pages
Bioinformatics - Trends and Methodologies
No ratings yet
Bioinformatics - Trends and Methodologies
736 pages
Additional Note PDF
No ratings yet
Additional Note PDF
25 pages
Bioinformatics Class 12 Presentation Paragraph
No ratings yet
Bioinformatics Class 12 Presentation Paragraph
14 pages
Sec1 Introduction To Bioinformatics
No ratings yet
Sec1 Introduction To Bioinformatics
20 pages
BTH 403-BTG407 Lecture 1
No ratings yet
BTH 403-BTG407 Lecture 1
6 pages
BioinChapter 1
No ratings yet
BioinChapter 1
35 pages
Report
No ratings yet
Report
34 pages
BCH 428 Slide
No ratings yet
BCH 428 Slide
32 pages
Bioinfo Course Notes M1 2020 DR Mbulli
No ratings yet
Bioinfo Course Notes M1 2020 DR Mbulli
56 pages
Bioinformatics Database Systems (Kevin Byron, Katherine G. Herbert Etc.) (Z-Library)
No ratings yet
Bioinformatics Database Systems (Kevin Byron, Katherine G. Herbert Etc.) (Z-Library)
49 pages
Unit 1
No ratings yet
Unit 1
24 pages
BCH 516-1
No ratings yet
BCH 516-1
32 pages
Book 2014 Samal Bioinformaticsmanual
No ratings yet
Book 2014 Samal Bioinformaticsmanual
121 pages
Labmanual CS 1
No ratings yet
Labmanual CS 1
52 pages
Bif501 Handouts PDF Bif
No ratings yet
Bif501 Handouts PDF Bif
197 pages
Fat Noews
No ratings yet
Fat Noews
32 pages
Special Topics in Ict
No ratings yet
Special Topics in Ict
55 pages
Module 2 (Bioinformatics)
No ratings yet
Module 2 (Bioinformatics)
81 pages
Stuart M. Brown-Bioinformatics - A Biologist's Guide To Biocomputing and The Internet-Eaton Publishing Company - Biotechniques Books (2000)
No ratings yet
Stuart M. Brown-Bioinformatics - A Biologist's Guide To Biocomputing and The Internet-Eaton Publishing Company - Biotechniques Books (2000)
189 pages
University of Okara: Name: Topic: Subject: Semester: Department
No ratings yet
University of Okara: Name: Topic: Subject: Semester: Department
29 pages
Lecture 1 - Biological Database
No ratings yet
Lecture 1 - Biological Database
14 pages
Bio in For Matics
No ratings yet
Bio in For Matics
17 pages
Bioinformatics: Basics, Development, and Future: July 2016
No ratings yet
Bioinformatics: Basics, Development, and Future: July 2016
27 pages
Latthika
No ratings yet
Latthika
21 pages
202 07 Bioinformatics
No ratings yet
202 07 Bioinformatics
14 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
Bioinformatics 1
No ratings yet
Bioinformatics 1
37 pages
IInd Sem Class1
No ratings yet
IInd Sem Class1
56 pages
Biological Database ODL
No ratings yet
Biological Database ODL
21 pages
Class 1 Bioinfo Course Microdome-1
No ratings yet
Class 1 Bioinfo Course Microdome-1
23 pages
FD30T3 Maintenance Manual
100% (1)
FD30T3 Maintenance Manual
11 pages
Physical Layer in Network: University of Technology Computer Science Department
No ratings yet
Physical Layer in Network: University of Technology Computer Science Department
12 pages
Consent Letter For Society
No ratings yet
Consent Letter For Society
3 pages
Plansa Motopompa Caprari MEC-MG 80-4-3A
No ratings yet
Plansa Motopompa Caprari MEC-MG 80-4-3A
2 pages
Carried Interest Directory
100% (1)
Carried Interest Directory
108 pages
COF - International Cyber Olympiad 2025
No ratings yet
COF - International Cyber Olympiad 2025
12 pages
Spearman's Rank Correlation Coefficient
No ratings yet
Spearman's Rank Correlation Coefficient
11 pages
TestBank IntroToIS 8e TechGuide4
No ratings yet
TestBank IntroToIS 8e TechGuide4
17 pages
Growth - Weekly Meetings
No ratings yet
Growth - Weekly Meetings
125 pages
Non-Exclusive Partnership Agreement
No ratings yet
Non-Exclusive Partnership Agreement
17 pages
Agile Suitability Tool Handout 18102020 093347am 2 PDF
No ratings yet
Agile Suitability Tool Handout 18102020 093347am 2 PDF
14 pages
ZN OSynthesis
No ratings yet
ZN OSynthesis
29 pages
Chem Soc Rev: Tutorial Review
No ratings yet
Chem Soc Rev: Tutorial Review
20 pages
Lo 1
No ratings yet
Lo 1
32 pages
Molecular Diagnostics in Virology - PMC
No ratings yet
Molecular Diagnostics in Virology - PMC
14 pages
Real Time Air and Water Quality Monitoring With Ai Based Data Analysis and Low Cost Sensors
No ratings yet
Real Time Air and Water Quality Monitoring With Ai Based Data Analysis and Low Cost Sensors
2 pages
Aqib ACTA
No ratings yet
Aqib ACTA
12 pages
Inspection Checklist
No ratings yet
Inspection Checklist
11 pages
United States Patent: (10) Patent No.: US 7,702,608 B1
No ratings yet
United States Patent: (10) Patent No.: US 7,702,608 B1
17 pages
Sars-Cov-2 Disrupts Host Epigenetic Regulation Via Histone Mimicry
No ratings yet
Sars-Cov-2 Disrupts Host Epigenetic Regulation Via Histone Mimicry
36 pages
Java Assignment 31 To 60
No ratings yet
Java Assignment 31 To 60
51 pages
Terms of Use
No ratings yet
Terms of Use
1 page
Daryl Kim Tech Resume
No ratings yet
Daryl Kim Tech Resume
2 pages
Murty 2012
No ratings yet
Murty 2012
41 pages
Practice Q Ans
No ratings yet
Practice Q Ans
11 pages
Semantic Segmentation of Remote Sensing Images Usi
No ratings yet
Semantic Segmentation of Remote Sensing Images Usi
12 pages
Recent Trends in Biocatalysis
No ratings yet
Recent Trends in Biocatalysis
10 pages
Unit 3 P4 Functions
No ratings yet
Unit 3 P4 Functions
32 pages
1 Annual Olympics
No ratings yet
1 Annual Olympics
25 pages
Size-Dependent Properties at Nanoscale - HW2 (1) Compressed
No ratings yet
Size-Dependent Properties at Nanoscale - HW2 (1) Compressed
8 pages
Full
No ratings yet
Full
86 pages
Murty 2012
No ratings yet
Murty 2012
37 pages
2.1 Pikkarainen 2004
No ratings yet
2.1 Pikkarainen 2004
13 pages
DNA RNA Genes and Chromosomes Fact Sheet-CGE
No ratings yet
DNA RNA Genes and Chromosomes Fact Sheet-CGE
4 pages
Канада
No ratings yet
Канада
9 pages
Appodeals Mobile Inapp Ad Monetization Performance Index 2021 July Edition
No ratings yet
Appodeals Mobile Inapp Ad Monetization Performance Index 2021 July Edition
55 pages
ICT133 Structured Programming: Tutor-Marked Assignment Presentation
No ratings yet
ICT133 Structured Programming: Tutor-Marked Assignment Presentation
10 pages
IM Appendix F Client Server Systems Ed12
No ratings yet
IM Appendix F Client Server Systems Ed12
7 pages
Clipsal 4CC AND 4FCC INSTALLATION ISTRUCTION
No ratings yet
Clipsal 4CC AND 4FCC INSTALLATION ISTRUCTION
2 pages
Upgrade Guide Build 8010 10020
No ratings yet
Upgrade Guide Build 8010 10020
3 pages
Chavan Motors Solapur
No ratings yet
Chavan Motors Solapur
2 pages
Limit-Switch For Standard Signals: AD-MK 330 GS
No ratings yet
Limit-Switch For Standard Signals: AD-MK 330 GS
2 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Bioinformatics Unveiled
From Everand
Bioinformatics Unveiled
Joan Melody
No ratings yet
Designing Life: The Promise and Potential of Biogenic Engineering
From Everand
Designing Life: The Promise and Potential of Biogenic Engineering
Jack Al-Kahwati
No ratings yet
Systems Biology: A Textbook
From Everand
Systems Biology: A Textbook
Edda Klipp
No ratings yet
Population Genetics
From Everand
Population Genetics
Matthew Hamilton
No ratings yet
Computational Approaches in Cheminformatics and Bioinformatics
From Everand
Computational Approaches in Cheminformatics and Bioinformatics
Rajarshi Guha
No ratings yet
Bioinformatics: Merging Biology and Technology
From Everand
Bioinformatics: Merging Biology and Technology
Mani Devar
No ratings yet
Bio-Computing and Industry: The Next Technological Revolution
From Everand
Bio-Computing and Industry: The Next Technological Revolution
Mustafa Al-Dori
5/5 (1)
Analysis of Biological Networks
From Everand
Analysis of Biological Networks
Björn H. Junker
No ratings yet
Applications of Multi-Omics: Fundamentals of Integrating Biological Data for Precision Medicine and Research
From Everand
Applications of Multi-Omics: Fundamentals of Integrating Biological Data for Precision Medicine and Research
Richard Skiba
No ratings yet
Bioinformatics Scientist - The Comprehensive Guide: Vanguard Professionals
From Everand
Bioinformatics Scientist - The Comprehensive Guide: Vanguard Professionals
Viruti Shivan
No ratings yet
Bioinformatics: Integrating Computational Techniques With Biological Data for Advanced Robotic Systems
From Everand
Bioinformatics: Integrating Computational Techniques With Biological Data for Advanced Robotic Systems
Fouad Sabry
No ratings yet
Bioinformatics: Algorithms, Coding, Data Science And Biostatistics
From Everand
Bioinformatics: Algorithms, Coding, Data Science And Biostatistics
Rob Botwright
No ratings yet
Structural Bioinformatics: Molecular Insights into Biomacromolecular Structures and Interactions
From Everand
Structural Bioinformatics: Molecular Insights into Biomacromolecular Structures and Interactions
Fouad Sabry
No ratings yet
Biological Computing: A Convergence of Nanotechnology and LifeInspired Computation
From Everand
Biological Computing: A Convergence of Nanotechnology and LifeInspired Computation
Fouad Sabry
No ratings yet

Previewpdf

Uploaded by

Previewpdf

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Bioinformatics: A Practical Guide to NCBI Databases and Sequence

Book · January 2022

The user has requested enhancement of the downloaded file.

About the Series:

Metabolomics: Practical Guide to Design and Analysis

An Introduction to Systems Biology: Design Principles of Biological Circuits

Computational Biology: A Statistical Mechanics Perspective

Stochastic Modelling for Systems Biology

Computational Genomics with R

An Introduction to Computational Systems Biology: Systems-level Modelling of Cellular Networks

Bioinformatics: A Practical Guide to NCBI Databases and Sequence Alignments

For more information about this series please visit:

Chapter 1 | The Origin of Genomic Information 1

Chapter 2 | The Sources of Genomic Data 39

Chapter 3 | The NCBI Entrez Databases 77

Chapter 4 | NCBI Entrez E-​Utilities and Applications 265

Chapter 5 | The Entrez Direct 289

Chapter 6 | R and Python Packages for the NCBI E-​Utilities 343

Chapter 7 | Pairwise Sequence Alignment 383

Chapter 8 | Basic Local Alignment Search Tool 407

The Origin of Genomic Information

GENETIC INFORMATION AND ITS TRANSMISSION

TABLE 1.1 Number of Chromosomes in the Somatic Cells of Some Organisms

FIGURE 1.2 (a) A bacterial chromosome (b) eukaryotic chromosome.

FIGURE 1.3 Human chromosome 11 showing banding patterns.

FIGURE 1.5 Fertilization in plant and animal.

FIGURE 1.6 Embryonic and fetal development.

FIGURE 1.7 Bacterial cell division.

STRUCTURE OF DNA AND GENOME

FIGURE 1.8 (a) Phosphate group and (b) deoxyribose sugar.

FIGURE 1.9 Formation of phosphodiester bond between deoxyribose sugars.

FIGURE 1.10 Purines adenine (left) and guanine (right).

FIGURE 1.11 Pyrimidines cytosine (left) and thymine (right).

FIGURE 1.12 Duple helix of DNA.

The duple-​stranded DNA can be represented as

Homo sapiens 2893.5 1.2% 19,593 121,425

FIGURE 1.13 Gene structure.

Gene Regulatory Region

FIGURE 1.14 Enhancer.

FIGURE 1.15 Promoter.

TABLE 1.3 IUPAC Degenerate Nucleotide Symbols [20]

First Position T C A G Third Position

T Phe Ser Tyr Cys T

The Non-​Coding Genomic Sequences

AGGCT AGGCT AGGCT AGGCT AGGCT AGGCT

FIGURE 1.16 Formation of pseudogene by reverse transcription.

FIGURE 1.17 Ribose and deoxyribose.

FIGURE 1.18 Uracil.

FIGURE 1.19 RNA types.

FIGURE 1.20 Ribosome.

FIGURE 1.21 Transfer RNA.

Messenger RNA transcription

FIGURE 1.22 Formation of microRNA.

FIGURE 1.23 Transcription initialization.

FIGURE 1.24 Transcription elongation stage.

FIGURE 1.25 Transcription termination.

FIGURE 1.26 Gene alternative splicing.

FIGURE 1.27 pre-​mRNA modifications.

TABLE 1.5 Genetic Code [21]

First Position U C A G Third Position

U Phe Ser Tyr Cys U

FIGURE 1.28 Possible open reading frames (ORFs)

Messenger RNA Translation

FIGURE 1.29 Translation in ribosome.

FIGURE 1.30 General chemical composition of amino acids.

FIGURE 1.31 Formation of peptide bond between amino acids.

FIGURE 1.32 Non-​polar side chains.

FIGURE 1.33 Amino acids with polar uncharged side chains.

FIGURE 1.34 Amino acids with electrically charged side chains.

FIGURE 1.35 Disulfide bridges between two cysteine amino acids.

TABLE 1.5 Amino Acids and Their Symbols

Three-​Dimensional Protein Structure

FIGURE 1.36 Primary structure of protein.

FIGURE 1.38 Tertiary structure of protein.

FIGURE 1.39 Quaternary structure of protein.

Chapter 4 | NCBI Entrez E-Utilities and Applications 265

Chapter 6 | R and Python Packages for the NCBI E-Utilities 343

The duple-stranded DNA can be represented as

The Non-Coding Genomic Sequences

FIGURE 1.27 pre-mRNA modifications.

FIGURE 1.32 Non-polar side chains.

Three-Dimensional Protein Structure

FIGURE 1.41 Key-value style of PDBx/mmCIF format.

FIGURE 1.42 Tabular style of PDBx/mmCIF format.