Previewpdf
Previewpdf
net/publication/357575487
CITATIONS READS
3 6,017
1 author:
Hamid Ismail
Michigan Technological University
89 PUBLICATIONS 332 CITATIONS
SEE PROFILE
All content following this page was uploaded by Hamid Ismail on 18 January 2022.
Virus Bioinformatics
Dmitrij Frishman, Manuela Marz
Multivariate Data Integration Using R: Methods and Applications with the mixOmics Package
Kim-Anh LeCao, Zoe Marie Welham
Hamid D. Ismail
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by CRC Press
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
CRC Press is an imprint of Taylor & Francis Group, LLC
© 2022 Hamid D. Ismail
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity
of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in
this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been
acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under US Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic,
mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or
retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermissions@
tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation
without intent to infringe.
Library of Congress Cataloging‑in‑Publication Data
Names: Hamid, Ismail D., author. | National Center for Biotechnology Information (U.S.)
Title: Bioinformatics : a practical guide to NCBI databases and sequence alignments / Ismail D. Hamid.
Description: First edition. | Boca Raton : CRC Press, 2022. |
Series: Chapman & Hall/CRC computational biology series |
Includes bibliographical references and index.
Identifiers: LCCN 2021036517 | ISBN 9781032123691 (hardback) |
ISBN 9781032128740 (paperback) | ISBN 9781003226611 (ebook)
Subjects: LCSH: Bioinformatics.
Classification: LCC QH324.2 .H35 2022 | DDC 570.285–dc23
LC record available at https://lccn.loc.gov/2021036517
ISBN: 978-1-032-12369-1 (hbk)
ISBN: 978-1-032-12874-0 (pbk)
ISBN: 978-1-003-22661-1 (ebk)
DOI: 10.1201/9781003226611
Typeset in Minion
by Newgen Publishing UK
Access the companion website: http://hamiddi.com/
Contents
Preface, vii
Acknowledgments, ix
Author bio, xi
INDEX, 453
v
Preface
I t has been a while since I began thinking of writing a bioinformatics book that introduces both theory
and practice. I have been practicing bioinformatics for a decade and, during this time, I read tens of bio-
informatics books of quality writing and contents. However, me as a reader, as many of my colleagues and
students, were striving to find a comprehensive and self-contained book that adds both theory and applica-
tion. The bioinformatics is a field of applied science where both theory and practice shall coexist side by side
and therefore an ideal bioinformatics book should provide a good introduction and be clearly written with
informative illustrations and provide sufficient depth and breadth to serve as a valuable reference for students
and researchers. Thus, from the beginning I set a very clear goal to write a book that blends both components
so it can serve as a quick reference for biologists.
The National Center for Biotechnology Information (NCBI) is a resource for molecular biology informa-
tion worldwide. It sets its mission to develop new information technologies to aid in the understanding of fun-
damental molecular and genetic processes that control health and diseases. To achieve this mission, the NCBI
has developed a variety of automated systems for storing and analyzing knowledge about molecular biology,
biochemistry, and genetics. Thus, the NCBI’s databases are the most comprehensive resources of biological
data; they integrate the most major databases worldwide; and they are the most commonly used biological
databases by biologists from around the globe.
Biological databases play a key role in the bioinformatics that modern biology relies on. They offer scientists
the opportunity to access a wide variety of experimentally driven biological data and relevant information that
can be used in genomic research and knowledge discovery in a wide range of applications in bioscience and
biomedicine.
This book provides the basics of bioinformatics and in-depth coverage of NCBI databases, sequence
alignment, and NCBI Sequence Local Alignment Search Tool (BLAST). As bioinformatics has become essen-
tial for life sciences, the book has been written specifically to address the need of a large audience, including
undergraduates, graduates, researchers, healthcare professionals, and bioinformatics professors who need to
use the NCBI databases, retrieve data from them, and use BLAST for finding evolutionarily related sequences,
sequence annotation, construction of phylogenetic tree, finding conservative domain of a protein, just to name
a few. Technical details of alignment algorithms are explained with a minimum use of mathematical formulas
and with graphical illustrations. The uniqueness of this book is that it provides readers with the most used
bioinformatics knowledge of bioinformatics databases and alignments, including both theory and application,
via illustrations and worked examples. The book also discusses the use of Windows Command Prompt, Linux
shell, R, and Python for both Entrez databases and BLAST.
The most important aspects of bioinformatics included in this book, and in-depth and up-to-date coverage
of the NCBI databases and BLAST, make this book an ideal textbook for all bioinformatics courses taken by
students of life sciences and for researchers wishing to develop their knowledge of bioinformatics to facilitate
their own research. The chapters are organized to be progressive from basic to more complex but they are also
linked and self-contained so that readers will find all the information they need to understand the content.
Chapter 1 discusses the basics of bioinformatics including eukaryote, prokaryote, and virus genome, com-
position and structures of DNA, RNA, genes, and protein, and the processes that are involved in biomolecule
vii
viii | Preface
formation and structure, such as transcription, translation, and mutations. The purpose of this introductory
chapter is to serve as a foundation for bioinformatics in general, but it also aims to provides the readers with
the knowledge to understand the origin of the genomic data that are the major elements of the biological
databases.
Chapter 2 continues setting the fundamental background knowledge for understanding the generation of
genomic data from biological samples and the file formats used in bioinformatics and to store the sequence
data on the genomics databases. It discusses the laboratory techniques that generate the experimentally
driven genomic data, which are the very basic elements of genomics and genomics databases. The techniques
discussed include polymerase chain reaction (PCR), first generation sequencing, next generation sequencing,
and the file formats, including FASTA, FASTQ, SAM/BAM, VCF, and AGP.
Chapter 3 covers the NCBI’s Entrez databases in detail. The discussion includes the use of the database,
database indexed fields, construction of search queries, results pages, the content of record pages, and linked
tools. This chapter aims to be practical, and readers can practice while reading to get the full benefit. The use
of each database is illustrated with screenshots of the results pages and individual record pages accompanied
with discussion of the content. This chapter also serves as introductory to Chapters 4, 5, and 6.
Chapter 4 also discusses the Entrez programming utilities (E-utils), which are programming application
interfaces (APIs) to the NCBI Entrez databases and enables accessing the databases from computer program-
ming languages. This chapters aims to provide readers with the basic programs of the E-utilities so programmers
can use them and regular users will understand the concept for the applications from other programs.
Chapter 5 covers the Entrez Direct (EDirect), which is a set of command-line programs wrapping around
the E-utils functions. The readers will learn how to install and use EDirect on Windows Linux and Mac OS
environments. The chapter also discusses how to use Linux bash shell command and scripting language to
extend the use of EDirect function for data retrieval and extraction.
Chapter 6 discusses the use of R packages (R Entrez) and Python (BioPython) for searching and retrieving
data from the NCBI databases. Both R and Python are free and open-source and the most used programming
scripting languages in the classroom and life science. Accessing Entrez databases from inside these two pro-
gramming environments will help in developing desktop and web applications or in the advanced research
works that require integration of different resources.
Chapter 7 discusses the basics and algorithms of the global and local alignments. It walks the reader through
the steps of the dynamic programming for the sequence alignment. It also illustrates the algorithms of the
regular BLAST and PSI-BLAST and demonstrates the steps of computing the position-specific scoring matrix
(PSSM), which is used for motif discovery and detection of conserved regions in protein sequences. The aim of
this chapter is to provide readers with the background knowledge on sequence scoring schemes, substitution
matrices, and the sequence alignment algorithms.
Chapter 8 covers the uses of BLAST programs. In the first part, the chapter discusses the uses of web-based
BLAST with worked examples of the most frequent scenarios, such as finding related sequences, construc-
tion of phylogenetic tree, finding conserved domain and structure in proteins, and sequence annotation. The
second part of this chapter focuses on the stand-alone BLAST, which is essential for users who run searches
exceeding the NCBI allowed limit. The chapter illustrates how to install and use the stand-alone BLAST and
how to create a custom BLAST database.
Acknowledgments
I would like to thank Dr Amani Babekir for her help in reviewing the chapter of this book several times
and providing useful comments that add to the quality of the content. I would like also to thank Amna
Ismail who helped me in drawing the figures of this books and designing the cover image. My thanks also
extend to both Dr Cory Brouwer and Dr Roshonda Jones for reviewing the book proposal and their encour-
aging feedback.
ix
newgenprepdf
Author bio
Hamid D. Ismail earned his Ph.D. and M.Sc. in computational science from North Carolina Agricultural and
Technical State University, USA, and Doctor of Veterinary Medicine (DVM) and Applied Diploma in Statistics
and Programming from the University of Khartoum, Sudan. He has worked as an adjunct professor, bioinfor-
matics specialist, bioinformatics tool developer, senior scientist, statistician, and database consultant with sev-
eral institutions. He is currently a Ph.D. Research Associate at North Carolina Agricultural and Technical State
University. Dr. Ismail is a published author in bioinformatics, statistics, machine learning, and bio-science,
and he has developed a number of bioinformatics desktop and web-based applications. He is also a reviewer
for a number of academic journals.
xi
CHAPTER 1
INTRODUCTION
The diversity of life, from a simple organism like bacteria to the largest animals, and the diversity of individ-
uals within a species, are guided by biomolecules inside the living cells called deoxyribonucleic acid (DNA).
The DNA molecule is formed of only four basic monomeric units known as DNA nucleotides composing of a
phosphate group, a sugar, and four different types of nucleobases or simply bases (adenine, cytosine, guanine,
and thymine). In bioinformatics, those four units are given the letters: A, C, G, and T respectively. The DNA
molecules in a living cell are represented as sequences of those four nucleotides forming the genome. Viruses
usually have small genomes; Bacteriophage spp has a median total length of 8689 bases (8.689 kb). The smallest
non-viral genome is that of a bacterium known as Carsonella ruddii, which has a genome of 164,376 bases
(164.376 kb). The total length of the human genome is 3,272,090,000 bases (3,272.09 Mb). Segments of DNA
known as genes control the different aspects of life of a living organism by instructing the cells to synthesize the
proteins, which do most of the work in cells and are required for the structure, function, and regulation of the
body tissues and organs. The instructions are transcribed into ribonucleic acid (RNA), which is translated into
a specific protein. The two-step process (transcription and translation) by which the information in gene flows
into proteins is known as the central dogma of molecular biology. The information in the DNA is also trans-
mitted from one generation to another. The new generation of a living organism inherits characteristics due to
DNA transmission from parents. The diversity in life is attributed to the ability of the DNA to change slowly
in search of better traits to adapt with changes in nature. Such changes or mutations contribute to the diversity
in life. Advancement in molecular biology and biotechnology made possible the capturing of the informa-
tion carried by DNA, RNA, and proteins. Sequences and other biological information from diverse species
and individuals within the species of organisms are now increasingly deposited by researchers and institutions
onto bioinformatics databases to be available for retrieval and analysis for research purposes. The genomic
information has revolutionized biology and made modern biologists dependent on bioinformatics, which uses
computer science to store, organize, search, manipulate, and retrieve the genomic information. Institutions
like the National Institute of Health (NIH), the European Molecular Biology Laboratory (EMBL), and the
Japanese Institute of Genetics contributed largely to the progress made in bioinformatics. Together, those three
institutes formed the International Nucleotide Sequence Database Collaboration (INSDC) [1], which is a joint
effort to collect and disseminate databases containing DNA and RNA, and protein sequences. The INSDC
includes GenBank (USA), the European Nucleotide Archive (UK), and DNA Data Bank of Japan (Japan). Those
three partners capture, preserve, share, and exchange a comprehensive collection of nucleotide sequences and
associated information on a daily basis. The INSDC policy allows public access to the global archives of nucleo-
tide data generated in publicly funded experiments. The submission of this genomic data is instrumented by the
fact that it is a pre-requisite for publication in scholarly journals. The database records are publicly available for
scientists from all over the world to access, analyze, draw conclusion, and publish their findings.
DOI: 10.1201/9781003226611-1 1
2 | The Origin of Genomic Information
Before digging deep, it is important to discuss some basics in genomics that will help readers to under-
stand bioinformatics. The foundation of bioinformatics is built on the data that represents the flow of genomic
information from the DNA, onto RNA, and proteins. Therefore, understanding the composition of these three
kinds of biomolecules, gene structure, gene transcription and expression, mutation, and techniques used to
obtain such genomic data is fundamental for understanding the biological databases and other bioinformatics
applications.
FIGURE 1.1 (a) Bacterial (prokaryotic) cell and (b) eukaryotic cell.
The Origin of Genomic Information | 3
Human 23
Chimpanzee 24
Cow 30
Sheep 27
Goat 30
Horse 32
Dog 39
Cat 19
Rice 12
Bean 11
Tobacco 24
a human somatic cell has 23 pairs of chromosomes, of which 22 pairs are autosomes, and one pair is the
sex chromosome (X and Y). Each eukaryotic chromosome is made up of DNA tightly coiled many times
around histones and it has a constriction point called the centromere, which divides the chromosome into two
sections called arms. By convention, the shorter arm of the chromosome is known as the p-arm and the longer
arm of the chromosome is known as the q-arm. The location of the centromere on each chromosome gives
the chromosome its characteristic shape and can be used to describe the location of specific genes. The ends of
chromosomes are protected by telomeres.
The DNA molecules in chromosomes wrap around complexes of histone proteins giving the chromosomes
compact compressed shapes so that the DNA molecules can fit in the nucleus. Without such packaging, DNA
molecules would be too long to fit inside cells.
There are different imaging techniques that can be used in studying the structure of chromosomes [5].
These techniques include light microscopy, fluorescence microscopy, electron microscopy, and coherent X-ray
diffraction imaging. Staining is usually used to enhance the contrast between cellular components allowing
the proper visualization of chromosomes with different imaging techniques. Differential staining along the
length of a chromosome leads to the production of clearly visible bands (cytogenetic bands), which provide
information about the chromosomes. Each chromosome has its own unique pattern bands. Therefore, they
can be used to identify individual chromosomes, in cytogenetic location of a gene, and to study abnormalities
in the chromosome due to deletions, insertions, or translocations.
The cytogenetic location consists of a chromosome number (in the human 1-23, X or Y), the arm of the
chromosome (p or q), and the position of the gene on the p or q arm. The position is usually designated by two
digits representing a region and a band. The two-digit region is sometimes followed by a decimal point and
one or more additional digits representing sub bands within a light or dark area. For example, the cytogenetic
location of BRCA1 gene in the human somatic cell is (17q21.31), which represents the address of BRCA1 gene
4 | The Origin of Genomic Information
as position 21 and sub band 31 on the long arm of chromosome 17. Chromosomes are represented graphically
by chromosome ideograms, which are used to show the relative size of the chromosomes of an organism and
their characteristic banding patterns generated by staining (Figure 1.3).
In the most complex organisms, like the human, a copy of chromosome pair is inherited from the mother
(ovum) and the other pair from the father (sperm). The offspring inherit some of their traits from their mother
and others from their father. The pattern of inheritance is different for the circular mitochondrial DNA. Only
ova keep their mitochondria during fertilization. Therefore, mitochondrial DNA is always inherited from the
female parent.
In complex animals like the human, the ova in the female and sperms in the male are formed in specialized
organs (ovaries in females and testes in males) by cell division called meiosis, which produces four different
daughter cells from a single cell (Figure 1.4) [6]. Each daughter cell is haploid that contains only one copy of
each chromosome pair. Generally, all male and female gametes, which are the products of meiosis cell division,
are haploid.
Fertilization occurs when the nucleus of a male gamete (sperm) combines with the nucleus of a female
gamete (ovum). In flowering plants, the male gamete is a cell in the pollen grain and the female gamete is
an egg cell in the ovule. When the male and female gametes combine, the resulting cell is called a zygote
(Figure 1.5) [7, 8].
Embryonic development, also known as embryogenesis, refers to the steps of development and formation
of the embryo from a fertilized ovum to a mature embryo (Figure 1.6). Embryogenesis is characterized by
the processes of mitosis cell division (after fertilization) and cellular differentiation of the embryo that occurs
The Origin of Genomic Information | 5
FIGURE 1.4 Cell division forming male sperms (top) and female egg (bottom).
during the early stages of development. Mitosis is a type of cell division that results in two daughter cells for
tissue growth. Each daughter cell will have the same number and kind of diploid chromosomes as the parent
cells [9].
After fertilization, the genetic material of the haploid sperm and haploid ovum then combine to form a
single diploid cell called a zygote and the germinal stage of development kicks off. In the human, embryonic
development covers the first eight weeks after fertilization. After nine weeks the embryo is turned into a fetus.
The gestation or pregnancy period for human is around 40 weeks (nine months).
In the case of bacterial cells, reproduction takes place using binary fission, which is basically an asexual cell
division where a single parent bacterial cell replicates its circular DNA and plasmids before splitting into two
new identical daughter cells (Figure 1.7) [10].
6 | The Origin of Genomic Information
Deoxyribonucleic acid or DNA was first observed by the Swiss biochemist Friedrich Miescher in the late
1800s. The structure of DNA was resolved by James Watson and Francis Crick in 1953 [11]. DNA is made
up of four molecules called nucleotides. Each nucleotide contains a phosphate group, a sugar group, and
a nitrogenous base. The phosphate group and the sugar group called deoxyribose are the same for all four
nucleotides (Figure 1.8). However, each nucleotide has a different nitrogenous base. Those four nitrogen
bases that distinguish the four nucleotides are adenine (A), thymine (T), guanine (G), and cytosine (C).
DNA consists of two strands that wind around each other like a twisted ladder. Each strand has a backbone
made of alternating deoxyribose sugars and phosphate groups that form the phosphodiester bond.
The phosphodiester bond is the linkage between the 3` carbon atom of a deoxyribose molecule of one
nucleotide and the 5` carbon atom of a deoxyribose molecule of another nucleotide (Figure 1.9).
The four nitrogenous bases give the four nucleotides their characteristics. Adenine (A) and guanine (G) are
purine bases, which are structures that are composed of a five-sided and six-sided ring (Figure 1.10).
Cytosine (C) and thymine (T) are pyrimidines, which are structures composed of a single six-sided ring
(Figure 1.11).
In DNA, any of the nitrogenous bases (A, C, G, and T) forms a glycosidic bond between its 1` nitrogen
and the 1` -OH group of the deoxyribose forming a single DNA strand. The two DNA strands are formed
by hydrogen bonds between the nitrogenous bases, where adenine (A) always binds to thymine (T), while
cytosine (C) and guanine (G) always bind to one another forming the duple helix structure of the DNA
(Figure 1.12). This relationship between any two bases (A and T) and (C and G) is known as complementary
base pairing.
The order of the nitrogenous bases in the genome of a specific organism determines DNA instructions, or
genetic codes, for building and maintaining the organism. The determined order of the four bases that make
up the genome or a segment of the genome is called a DNA sequence.
The nucleotides in a DNA sequence are arranged in the direction from the 5-prime end, which has a free
hydroxyl (or phosphate) on a 5` carbon to the 3-prime end, which has a free hydroxyl (or phosphate) on a 3`
carbon. The DNA strand 5` → 3` is known as the plus “+” DNA strand and its complementary strand is known
as the minus “-” DNA strand.
In bioinformatics, the nucleotide sequence of a piece of DNA is represented using the one-letter abbreviations
of the four bases A, T, C, and G to identify the order of the nucleotides in the DNA molecule or a fragment of
the DNA molecule. The DNA representation is usually for the plus DNA strand (from 5` end to 3` end). The
complementary strand of a DNA sequence can be easily predicted from the other strand. For example, assume
that the following is a representation of a DNA sequence:
CTCACTGA
The Origin of Genomic Information | 9
Each letter represents a nucleotide base (A for adenine, C for cytosine, G for guanine, and T for thymine). As
A will bind to T, and C will bind to G and vice versa, the complementary strand for the above DNA sequence
will be:
GAGTGACT
CTCACTGA
| | | | | | | |
GAGTGACT
Such representation of DNA sequence is the basis for the genomic data that the field of bioinformatics
depends on.
Not all the genomic DNA sequence is functional; only small portion of it contains functional units called
genes. A gene is the basic physical and functional unit of heredity in the genome. Genes vary in length from
a few hundred DNA bases to millions of bases. Some genes carry the instructions to make proteins, which do
most of the work in cells and are required for the structure, function, and regulation of the body tissues and
organs. However, many genes do not code for proteins but may code for other biomolecules such as transfer
RNA (tRNA), ribosomal RNA (rRNA), and microRNA, which are components of translation machinery and
gene regulation. Scientists usually use sequence information to determine the genes and other sequence units
of the genome and their functions and implication in traits and diseases. The Human Genome Project (HGP)
estimated that humans have between 20,000 and 25,000 genes [12, 13]. Most genes are the same in individ-
uals of the same organism, but a small number of genes may be slightly different among individuals due to
mutations. In complex organisms, since chromosomes are in pairs, one pair is inherited from the mother and
the other from the father, an individual will have two copies (versions) for each gene (a copy from the mother
and another from the father). Those two copies for the same genes are called alleles. The two alleles of a specific
gene may be identical and having the same sequence (homozygous) or may not be identical (heterozygous).
The difference is usually due to mutation (substitution, deletion, or insertion), which contributes to variation
among individuals within the same organism. A genotype is defined as all genes passed onto an individual
by their parents. However, not all the genes passed to the offspring are translated into visible traits. The set of
physical characteristics an individual has is called a phenotype. Any gene has two alleles (versions). An allele is
either dominant, which shows its phenotypic effect, or recessive, which does not show its effect in the presence
of a dominant allele. The dominant allele masks the effect of a recessive allele and it is denoted by a capital
letter (A) versus small letter (a) for the recessive allele. In complex eukaryotes, since each parent provides one
allele, the possible combinations are: (AA), (Aa), and (aa). The offspring whose genotype is either (AA) or
(Aa) will have the dominant trait (A) expressed as a phenotype, while the individuals with (aa) will express
the recessive trait. However, inheritance can also be codominance, which is a form of inheritance wherein the
alleles of a gene pair in a heterozygote are fully expressed. As a result, the phenotype of the offspring is a com-
bination of the phenotype of the parents. The AB blood type is an example of codominance in humans. A trait
or phenotype may be influenced by multiple genes. Such trait is called polygenic trait.
Genes are given unique gene names and gene symbols following the guidelines published by the Gene
Nomenclature Committee, which is the organization that sets the standards for gene nomenclature [14, 15].
For example, a gene on human chromosome 17 that has been associated with the suppression of breast tumor
is known as Breast cancer type 1 susceptibility gene and its gene symbol is BRCA1.
The HGP completed the sequencing of the human genome for the first time in April 2003. The sequencing
was followed by genome annotation, which is the process of identifying the locations of genes and the
10 | The Origin of Genomic Information
TABLE 1.2 Genomes of Some Organisms (genome size in Mb= millions of base pairs)
% Protein-Coding Number of Protein-
Organism Genome Size (Mb) Sequence Coding Genes Number of Proteins
protein-coding regions in a genome and determining what those genes do. The genome sequences of many
organisms have been determined following the sequencing of the human genome. Their sequences provide
interesting comparisons to that of the human genome and are proving useful in facilitating studies of
different model organisms and in identifying a variety of different types of functional sequences, including
regulatory elements that control gene expression. The genome sequences of humans and chimpanzees are
about 99% identical. The genome of Neandertals, our closest evolutionary relative, has also been recently
sequenced. It is estimated that Neandertals and modern humans diverged about 300,000–400,000 years ago
[16]. The genomes of Neandertals and modern humans are more than 99.9% identical, significantly more
closely related to each other than to chimpanzees. The number of genes in eukaryotic genomes is simply
related to neither genome size nor biological complexity. The genome of the small plant Arabidopsis thaliana
is only about 5% the size of the human genome, it contains approximately 26,000 protein-coding genes,
compared with around 20,000 in the human genome. The discrepancies between the eukaryotic genome size
and the number of protein-coding genes are because the genomes of most eukaryotic cells contain not only
protein-coding sequences but also large amounts of DNA that does not code for proteins called non-coding
sequence.
Table 1.2 shows the genome sizes, sizes of protein coding regions, and the number of proteins for some
organisms. The data was obtained from the NCBI Genome database (last updated: January 27, 2020).
Generally, the genome of any organism can include two types of sequences: coding sequences or genes and
non-coding sequences.
Gene Structure
The genes are genome sequences that contain genetic information or codes that can be transcribed into instruc-
tion or RNA (tRNA, rRNA, microRNA, and mRNA). RNAs are functional biomolecules that are expressed
to carry out specific biological functions. Although the term gene is frequently used for only those DNA
sequences that are transcribed into messenger RNA (mRNA) and translated into proteins, it can be used in a
broader context to include any DNA sequences that contain code for performing biological functions in the
cell. Thus, the gene can be defined as a genomic DNA sequence that represents a functional unit and can be
either protein coding genes or non-protein-coding genes. Protein-coding genes are transcribed into mRNA
molecules that carry the coding sequences or transcripts for protein synthesis. Non-protein-coding genes are
transcribed into the other types of RNA.
Since the gene is a functional unit, it is different from non-genic DNA sequences. The gene structure
determines when and where the gene will be expressed and the amount of gene expression and the products.
The DNA sequence of a functional gene is divided into certain regions. The gene sequences of the eukaryotic
organism have three types of regions –the regulatory region, introns (non-coding region), and exons (coding
region) (Figure 1.13) [17, 18].
The gene structure of prokaryotic cells (bacteria and archaea) is simpler than that of eukaryotic cells as they
lack introns.
The Origin of Genomic Information | 11
Enhancer In some eukaryotic genes, the enhancer region enhances the transcription of the gene by assisting
the core prompter to bind to a TF. The enhancer may be located upstream of a gene, within the coding region
of the gene, downstream of a gene, or may be thousands of nucleotides away. The enhancer sequence is a
binding site for transcription factors. The enhancing process is initiated when a DNA-bending protein binds
to the enhancer sequence of the gene and bends it. This bending allows the interaction between the activators
(bound to the enhancers) and the transcription factors (bound to the promoter region) and the RNA poly-
merase to occur (Figure 1.14) [19].
Core Promoter The core promoter (Figure 1.15) is the region of the gene where the general TFs and RNA poly-
merase II (RNA Pol II) are assembled for the initiation of transcription of RNA. The core promoter sequence is
about 25–40 base pairs upstream of the transcription start site (TSS) of the coding region. The core promoter
may contain the following sequence elements, which act as recognition sites for the binding of the TFs.
(1) TATA box is a DNA sequence that indicates where a genic sequence can be read and decoded. The
sequence of TATA box region is rich in thymine and adenine (T/A-rich sequence) with the consensus
sequence, which is the most frequent nucleotide sequence, as 5`-TATAWAAR-3` (see Table 1.3 for
12 | The Origin of Genomic Information
R Purine A or G
Y Pyrimidine C or T
W Weak A or T
S Strong C or G
M Amino A or C
K Keto G or T
H Not G A, C, or T
B Not A C, G, or T
V Not T A, C, or G
D Not C A, G, or T
N Any A, C, G, or T
the nucleotide symbols). The TATA box is usually located about 25-35 base pairs upstream of the
TSSs (Figure 1.15). In bacteria, this region is called the Pribnow box, which has a shorter consensus
sequence.
(2) Initiator (Inr) element is located from -2 to +4 overlapping the TSS. For example, the consensus
sequence of Inr is 5`-YYANWYY-3` in humans and 5`-TCAKT in Drosophila (see Table 1.3 for the
nucleotide symbols). The Inr works by enhancing binding affinity and strengthening the promoter.
(3) TFIIB recognition element (BRE), which is located at -32 to -37 and its consensus sequence is 5`-
SSRCGCC-3` (see Table 1.3). It can have positive or negative effects on transcription in a promoter
context-dependent manner.
The Origin of Genomic Information | 13
(4) Motif ten element (MTE) is a conserved element with consensus sequence 5`-CSARCSSAACGS-3`
(see Table 1.3). It promotes gene transcription by RNA polymerase II.
(5) Downstream promoter element (DPE) is located at +28 to +32. The DPE functions cooperatively with
the Inr for the binding of TFIID in the transcription of core promoters in the absence of a TATA box.
Proximal Control Elements The promoter proximal control element (PPE) is within around 100 nucleotides
upstream from the core promoter. It stimulates transcription of the gene by interacting with TFs. The number,
identity, and location of the proximal element vary from gene to gene. The PPE may contain a CAAT box,
which is a distinct pattern with GGCCAATCT consensus sequence that occurs upstream by 60–100 bases to
the TSS.
Introns
Introns are non-coding sections of eukaryotic genes. There are no introns in prokaryotic genes. They are
transcribed in the mRNA, but they are removed before mRNA is translated into a protein. Introns can range
in size from tens to thousands of base pairs and can be found in a wide variety of genes that generate RNA in
most living organisms, including viruses. It is vital that introns be removed precisely, as any left-over intron
nucleotides, or deletion of exon nucleotides, may result in a faulty protein being produced. This is because the
amino acids that make up proteins are joined together based on codons, which consist of three nucleotides.
An imprecise intron removal thus may result in a frame shift and the genetic code would be read incorrectly.
Exons
Exons are coding sections of a gene. In eukaryotic genes, after the removal of introns in the transcribed RNA,
exons are spliced together to form gene transcripts, which include the codons that code for the amino acids
of the proteins. The genetic code is the set of rules by which information encoded in RNA is translated into
proteins (Table 1.4). The gene transcript is composed of tri-nucleotide units called codons, each coding for a
single amino acid. The protein-coding gene transcript contains an open reading frame (ORF). The ORF is a
continuous stretch of codons that begins with a start codon (usually ATG) and ends at a stop codon (usually
TAA, TAG, or TGA) and the codons that are translated into amino acids are between the start codon and stop
codon. In eukaryotic genes, following transcription, immature strands of messenger RNA, called pre-mRNA,
TABLE 1.4 Genetic Code for the Amino Acids Forming Proteins [21]
Second Position
may contain both introns and exons. These pre-mRNA molecules go through a modification process in the
nucleus called splicing, during which the non-coding introns are cut out and only the coding exons remain.
Splicing produces a mature messenger RNA or the transcript that is then translated into a protein.
Repetitive Sequences
Repetitive sequences are highly repeated non-coding DNA sequences that represent a large portion of
eukaryotic genomes (hundreds of thousands of copies per genome). They are categorized into three major
classes: simple repeated sequences, retrotransposons, and DNA transposons.
Simple Repeated Sequences Simple repeated sequences, also known as short tandem repeats (STRs) or
microsatellites, are a class of repeated sequences that consist of thousands of copies of short tandemly repeated
DNA sequences of a repetitive unit of one to six base pair forming repetitive sequences of up to hundreds of
nucleotides. For example, “AGGCT”, which is a sequence of five nucleotides, is repeated tandemly six times.
If the repeated sequence is short, it can be called a minisatellite. A longer tandemly repeated sequence is
called a microsatellite. Such simple repeated sequences are widely found in prokaryotes and eukaryotes. They
may account for about 5% of the total genomic DNA in humans and a varied percentage in the genomes of
other organisms [23].
Retrotransposons Retrotransposons are the transposable elements whose transposition in the genome is
mediated by reverse transcription of RNA to DNA. Some RNA molecules are converted to complementary
DNA (cDNA) by an enzyme called reverse transcriptase. The new DNA copy is integrated at a new site in the
genome. This class of repetitive sequences is a major contributor to the genome size (45% of human genomic
DNA). Retrotransposons are classified into short, interspersed elements (SINEs), long interspersed elements
(LINEs), and retrovirus-like elements [24, 25]. SINEs and LINEs are examples of transposable elements that
can move to different sites in genomic DNA. SINEs may have thousands of copies (850,000 in human genome).
They range from 100 to 700 bp in length and make up around 21% of the genome. LINEs may have thousands
of copies (1,500,000 in human genome), which make up around 13% of human genome. The retrovirus-like
elements are DNA sequences that resemble retrovirus DNA sequences. A retrovirus is an RNA virus that uses
RNA as its genetic material. It is characterized by inserting a copy of its genome into the genomes of infected
cells. Analysis of genomic DNA reveals the presence of many thousands of retroviral-like elements, which
suggests that some of the genomes of retrovirus integrated in animal genomes and that the reverse transcrip-
tion of RNA to DNA played a major part in shaping the eukaryotic genome. The human retrovirus-like elem-
ents range from approximately 2,000 to 10,000 bp in length. There are approximately 450,000 retrovirus-like
elements in the human genome (around 8% of human DNA) [26].
DNA Transposons DNA transposons is a class of interspersed repetitive elements (DNA transposons) that
moves through the genome by being copied and reinserted as DNA sequences, rather than moving by reverse
The Origin of Genomic Information | 15
transcription. In the human genome, there are about 300,000 copies of DNA transposons, ranging from 80 to
3000 bp in length, and accounting for approximately 3% of human DNA [27].
Pseudogenes
A gene is a genomic DNA sequence with a coding region and regulatory regions. Sequencing and annota-
tion of the human genome and the genomes of other organisms uncovered the presence of multiple copies
of some known genes, but they lack some or their entire regulatory region. Therefore, they are not functional
genes, and they are classified as non-coding sequence. These non-functional genes are called pseudogenes.
In fact, some genes have multiple functional copies to produce more proteins when needed. Some of these
copies may mutate and become pseudogenes. A pseudogene may also be formed when a functional gene is
transcribed into mRNA but, instead of being transported to ribosome and translated into a protein mol-
ecule, it is converted into a piece of double-stranded complementary DNA (cDNA) by the enzymatic reverse
transcription (Figure 1.16). This cDNA is then integrated into the genome and becomes part of it as non-
functional pseudogene [28]. Studies have shown that there are around 11,000 pseudogenes in the human
genome [29, 30].
Telomeres
A telomere (Figure 1.2b) is a repetitive nucleotide sequence of the genomic DNA found at each end of a
eukaryotic chromosome and linear chromosome of several bacteria including Streptomyces, Borrelia, and
Rhodococcus. Telomeres act as caps that protect the end of chromosomes from deterioration. They play crit-
ical roles in chromosome replication and maintenance during cell division. Studies showed that telomeres get
shorter each time a cell copies itself. The repeated sequences of telomere consist of clusters of guanine residues
(Gs). The sequence of telomere repeats in humans and other mammals is “TTAGGG”, which is repeated
hundreds or thousands of times and terminate with a 3` overhang of single-stranded DNA. Maintenance of
telomeres appears to be an important factor in determining the life span and reproductive capacity of cells.
Studies have shown that cancer cells have high levels of an enzyme called telomerase, which allow the cells to
maintain the ends of their chromosomes through indefinite divisions [31, 32].
RIBONUCLEIC ACID
Ribonucleic acid or RNA, is a polymeric molecule produced from a gene by the process of gene transcription.
RNA is essential in almost all biological activities for its roles in coding for proteins (mRNA), transferring
of amino acids (tRNA), translating codons into amino acids (rRNA), and regulating expression of genes. An
RNA molecule is a single-stranded polymer that is slightly different from the DNA. It contains the sugar ribose
instead of deoxyribose that is found in the DNA molecule (Figure 1.17). The difference between ribose and
16 | The Origin of Genomic Information
deoxyribose is that ribose possesses the hydroxyl group binding to the second carbon atom while deoxyribose
possesses only the hydrogen atom without oxygen [33].
Moreover, the RNA molecule is made up of the same nucleotides as DNA except that thymine is replaced
by a similar nucleotide called uracil (U) (Figure 1.18). Thus, the RNA sequence is made up of A, C, G, and U.
The order of nucleotides in an RNA molecule or fragment is represented by a sequence that is composed of
the letters A, C, G, and U for adenine, cytosine, guanine, and uracil nucleotide, respectively. An RNA sequence
may look like:
CUCACUAAAGACAGAAUGAAUGUAGAAAAGGCUGAAUUCUGUAAU
RNA is always a product of a gene as a result of gene transcription that is regulated by regulatory factors
and complex pathways to accommodate the biological activities in the cell. There are four types of RNA mol-
ecule in living cells (Figure 1.19): messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA (rRNA),
and MicroRNA (miRNA). Only mRNA is a protein-coding RNA (4%) while the others are protein non-coding
RNAs (96%).
Ribosomal RNA
Ribosomal RNA (rRNA) molecules are one of the major components forming the protein-synthesizing organ-
elle called ribosome in living cells. Growing mammalian cells may contain about ten million ribosomes. They
are transcribed from genes in the nucleus in eukaryotes or chromatin body in prokaryotes and exported to the
The Origin of Genomic Information | 17
cytoplasm to translate the genetic information carried by mRNA into proteins. There are several rRNA types
depending on the molecule size measured on Svedberg coefficient (S). A Svedberg unit (S) is a non-metric unit
for sedimentation rate in a test tube under the centrifugal force of an ultra-high-speed centrifuge. A Svedberg
unit (S) is 10-13 seconds. The S value of an RNA molecule is determined by its mass, density, and shape.
An rRNA molecule or ribosomal subunit may be given the name “nS”, where “n” is the number of Svedberg
units. Both prokaryotic and eukaryotic ribosomes are formed of a larger and smaller subunit (Figure 1.20).
Prokaryotic ribosomes are composed of 50S and 30S while eukaryotic ribosomes are composed of 60S and 40S
subunits. The two subunits come together during mRNA translation into a protein. Each ribosome subunit is
made of both rRNA and proteins [34].
In bacteria, the larger ribosomal subunit (50S) contains 5S rRNA (≈120 nucleotides), 23S rRNA (≈3000
nucleotides), and 31 proteins. The smaller subunit (30S) contains 16S rRNA (≈1500 nucleotides) and 21
proteins. Each of these rRNA molecules (5S rRNA, 23S rRNA, and 16S rRNA) has its own gene called the 5S
rRNA, 23S rRNA, and 16S rRNA gene, respectively. Ribosomal RNA genes are called ribosomal DNA (rDNA).
The 16S rRNA gene is used in phylogeny and bacterial identification [35].
In eukaryotes, the 60S or large ribosomal subunit has three rRNA molecules, 28S rRNA (4,800 nucleotides),
5.8S rRNA (160 nucleotides), and 5S rRNA (120 nucleotides), and 50 proteins. The 40S small ribosomal sub-
unit has only 18S rRNA (1,900 nucleotides) and 33 proteins. Each eukaryotic rRNA has its own coding gene.
The rRNA gene or rDNA has been studied in detail for its biological importance and repeated nature [36, 37].
Transfer RNA
Transfer RNA (tRNA) plays an important role in the translation of protein synthesis in ribosomes. It translates
the message coded in the nucleotide sequences of mRNA into specific amino acids, which are joined together
to form a protein. Although tRNA is a single-stranded molecule, it can form double-stranded structures that
are important to its function in protein synthesis. Single strands of tRNA can hybridize to form a double-
stranded molecule and many secondary structures in which a single RNA molecule folds over and forms
hairpin loops stabilized by hydrogen bonds between complementary nucleotides [38]. Such base-pairing of
RNA is critical for tRNA. The tRNA molecule has two important regions: the amino acid binding site where
a specific amino acid is binding to the tRNA molecule, and a hairpin-like trinucleotide region called the anti-
codon, which is complementary to a codon in mRNA in the time of translation (Figure 1.21). A codon consists
of three nucleotides on an mRNA strand that encodes a specific amino acid. During translation (passing of
mRNA between the two ribosome subunits), a codon on mRNA complements an anticodon on the tRNA and
the corresponding amino acid is released from the tRNA and added to the growing polypeptide chain.
MicroRNA
MicroRNAs, or miRNAs, are small non-coding RNA molecules of about 22 nucleotides in length. They are
encoded in the genomes; some of them are transcribed from within protein-coding units (exons) of a gene
and others are transcribed from within non-coding regions of a gene (introns). The miRNAs generated from
18 | The Origin of Genomic Information
exons are called canonical miRNAs, while the ones generated from introns are called non-canonical miRNAs.
MiRNAs are produced by two ribonucleases (Drosha and Dicer). The precursors of the miRNAs may be long
RNAs that fold to form hairpin secondary structures, which are then cleaved sequentially by the ribonucleases
(Figure 1.22).
MiRNAs play important regulatory role in gene expression by halting translation of mRNA into protein
with a process called RNA interference (RNAi) or gene silencing. A miRNA strand is incorporated into the
RNA-induced silencing complex (RISC), which is a protein complex that uses the incorporated miRNA strand
to complement the target mRNA and to stimulate its degradation by a protein in RISC.
The miRNA targeting sites are usually at 3`-untranslated regions (UTR) of the targeted mRNA. A single
miRNA can target up to 100 different mRNAs. Therefore, miRNAs may regulate more than half of the protein-
coding genes in eukaryotes. MiRNAs have also been linked to some types of cancers and translocation muta-
tion in chromosomes [39, 40].
Messenger RNA
Messenger RNA (mRNA) is transcribed from a gene and synthesized in the nucleus using the nucleotide
sequence of the gene as a template. The transcribed mRNA is complementary to the nucleotide sequence of one
DNA strand of the gene. The process of mRNA transcription requires nucleotide triphosphates as substrates
and the enzyme RNA polymerase II as a catalyst. The transcribed mRNA of prokaryotes is different from that
of eukaryotes. The prokaryotic primary mRNA contains no introns, while the eukaryotic primary mRNA
contains introns that will be spliced out to keep only the exons to form the final mRNA. The mRNA is formed
in the nucleus and transported to the cytoplasm where it passes through the ribosome (between smaller and
larger ribosome subunits) and interacts with the tRNA so the transcript will be translated into amino acids and
the protein is assembled. In the following we will discuss both transcription and translation [41].
distinguishes cells of an organ from others even though all somatic cells have the same genome and the same
set of genes. For example, the genes expressed in hepatocytes or liver cells are the genes that play pivotal roles
in metabolism, detoxification, protein synthesis, and innate immunity against invading microorganisms. The
set of gene expressed in liver cells will be different from the set of genes expressed in the cells of lungs, kidney,
or brain. However, only housekeeping genes are expressed in all cells because they perform basic biochemical
activities that occur in all cells. Housekeeping genes are typically required for the maintenance of basic cellular
functions that are essential for the existence of a cell, regardless of its specific role in the tissue or organism. Thus,
they are expressed in all cells of an organism under normal and pathophysiological conditions, irrespective
of tissue type, developmental stage, cell cycle state, or external signal. Therefore, they are widely used as
internal controls for experimental studies such as in polymerase chain reaction (PCR) experiments. The list of
the housekeeping genes may vary from one organism to another. Some of the housekeeping genes of human
and other animals may include beta-actin (ACTB), glyceraldehyde-3-phosphate dehydrogenase (GAPDH),
phosphoglycerate kinase 1 (PGK1), peptidylprolyl isomerase A (PPIA), ribosomal protein P0 (PPLP0), acidic
ribosomal phosphoprotein PO (ARBP), and others [42, 43].
Generally, the amounts and types of mRNA molecules in a cell reflect the biological activities of that cell.
Thousands of different mRNAs are produced every second in every cell. The expression of any gene is regulated
by a complex pathway to provide the adequate number of proteins. Since the gene expression in any kind of
cells is affected by several factors, the level of expression of all genes called gene profiling in a cell in a spe-
cific time and in specific conditions always serves as a snapshot or check point for some genomic and cancer
research and it is also used in diagnosis of some diseases.
20 | The Origin of Genomic Information
RNA molecules are initially synthesized as precursors or pre-RNA, which must be processed by a pro-
cess called splicing in order to remove the introns and release the exons that are spliced together to form the
mature RNA. Splicing of pre-mRNA is therefore a major part of the process that results in synthesis of the
protein-coding component of the transcriptome or the whole mRNA of the cell. Some eukaryotic genes have
an ability to code for different protein variants or isoforms through so-called alternative splicing, which is the
rearranging and joining of exons in varied combinations to alter the sequence of the coded proteins. The pro-
tein isoforms may have different functions or properties.
The steps of the mRNA transcription from a gene in eukaryotic cells include initialization of mRNA tran-
scription, transcript elongation, transcript termination, and post-transcriptional modification [44].
Initialization Stage Transcription of the mRNA is regulated by transcription factors, which are proteins that
bind to the core promoter region of the gene. Most TFs have ligand binding domain (LBD) and DNA binding
domain (DBD). LBD is the site that, when it binds to an activating substance, activates the DBD of the TF to
bind to the DNA promoter region of the target gene. TFs are activated and deactivated by a complex pathway.
Once a TF binds to the promoter region of a target gene, the RNA polymerase II will bind to the polymerase
binding sequence of the core promoter region and then it separates the double-stranded gene providing two
separate strands of DNA, one of which will be the template strand or anti-sense strand for the transcription of
mRNA and the other is the non-template strand or coding strand or sense strand (Figure 1.23).
Elongation Stage Once the two DNA strands of the gene are separated, only one strand (the template strand)
will be the template for the RNA polymerase II. The enzyme will start to read one base at a time to build an
RNA strand out of complementary nucleotides. The complementary RNA strand grows from 5` end to 3` end
and will continue until all mRNA is transcribed (Figure 1.24). The RNA transcript will be identical to the non-
template strand except that thymine (T) will be replaced by uracil (U).
Termination Stage The mRNA strand will continue to grow in the above elongation stage until it reaches a
transcribed sequence that bends to form a hairpin structure called a terminator, which terminates the tran-
scription process (Figure 1.25). The pre-mRNA is then released from the RNA polymerase.
Posttranscriptional Modifications After transcription, the pre-mRNA undergoes some essential modifications
to be ready for the translation stage in ribosomes. The post-transcriptional modifications of the pre-mRNA
include splicing and 5`-end and 3`-end modifications (addition of a cap to the 5`-end, and addition of a
poly(A) tail to the 3`-end).
Gene Splicing Gene splicing is a post-transcriptional modification in which a single gene can code for multiple protein
variants or isoforms. There are several types of gene-splicing processes, which can take place simultaneously after the
pre-mRNA is formed. Exon Skipping is the most common gene-splicing process in which an exon or exons are included
or excluded from the final gene transcript leading to extended or shortened mRNA variants. Intron Retention (IR) is
the splicing process in which an intron (non-coding region) is retained in the final transcript. The IR plays a role in
gene expression regulation and associations with complex diseases. Alternative 3` Splice Site and alternative 5` Splice
Site are the splicing mechanism in which an alternative gene splicing includes joining of different 5` and 3` splice site.
Figure 1.26 shows the types of gene splicing [45].
The 5`-End and 3`-End Modifications Both ends of the pre-mRNA are modified. A cap of 7-methylguanosine is
attached at the 5`-end of the pre-mRNA. This cap is needed to help initiate translation of the mRNA into a protein in
the ribosome.
A series of about 250 adenine nucleotides called poly(A) tail is attached to the 3`-end of the pre-mRNA. The
poly(A) synthesis is catalyzed by poly(A) polymerase.
22 | The Origin of Genomic Information
The structure of the mature functional mRNA consists of 5` cap, 5` untranslated region (5`UTR), coding
region called open reading frame (ORF), 3` untranslated region (3`UTR), and poly-A tail (Figure 1.27).
Five Prime Untranslated Region The 5`-untranslated region (5`-UTR) is the untranslated part of a mRNA
strand that extends from its 5`-terminus (cap site) to the translational start codon (AUG). The 5`-UTR
sequence length is about three to ten bases in prokaryotic mRNA and varies from hundreds to thousands
bases in eukaryotic mRNA (150 bases in human).
Three Prime Untranslated Region The 3`-untranslated region (3`-UTR) is the section of mRNA that immedi-
ately follows the translation termination codon (stop codon).
The Open Reading Frame The protein-coding region is known as open reading frames (ORFs). It consists of a
series of codons that specify the amino acid residues of the protein sequence that the gene codes for. A codon
consists of three consecutive nucleotides that code for a specific amino acid. There are 64 possible codons
but there are only 20 amino acids used for protein synthesis. Some amino acids can be coded by more than
one codon.
The Origin of Genomic Information | 23
Table 1.5 shows the codons and the amino acids. The first bases of the codon are in the first column, the
second bases are in the first row, and the third bases are in the last column of the table. For example, UCU,
UCC, UCA, and UCG are the codons for serine (Ser). The property, in which a single amino acid is coded by
multiple codons, is called degeneracy of codons or the redundancy of the genetic code. The genetic code is
degenerate mainly at the third codon position to reduce the effect of mutations.
The ORF usually (but not always) begins with the start codon AUG, and ends with a stop codon, which is any
one of UAA, UAG, or UGA. ORF scanning or ab initio gene prediction programs predict genes by searching
a DNA sequence for ORFs that begin with an ATG and end with a termination triplet. The gene scanning is
complicated by the fact that each DNA sequence has six possible reading frames; three possible frames are
in one direction and three frames are in the reverse direction on the complementary strand (Figure 1.28).
Computers are quite capable of scanning all six reading frames for ORFs.
Given the coding region of a gene, we can use the above information to predict the mRNA sequence and
the ORFs.
involves interactions among three types of RNA molecules (mRNA, tRNA, and rRNA), as well as various
proteins that are required for translation.
During the translation of a mRNA into a protein, the 20 amino acids are aligned with their corresponding
codons on the mRNA template. A tRNA is approximately 75 nucleotides long and it forms complementary
base pairing between different regions of the molecule and folds into an L-shape to fit onto ribosomes during
the translation process (Figure 1.29). Each of the tRNA strand has the sequence CCA at its 3` terminus, where
an amino acid is covalently attached to the ribose of the terminal adenosine (A). The codon on the mRNA
template is recognized by the anticodon loop located at the other end of the folded tRNA, forming comple-
mentary base pairing. The attachment of each amino acid to its specific tRNA is mediated by a specific enzyme
called aminoacyl tRNA synthetase. There is an enzyme for each amino acid. When a codon complements an
anticodon, the amino acid attached to the tRNA will be released to bind to the growing protein sequence. The
translation will continue until the entire ORF is translated and the stop codon is reached. The process of trans-
lation is carried out by millions of ribosomes in the cell simultaneously [46].
THE PROTEINS
Proteins are the biomolecules that are translated from gene transcript or mRNAs as discussed above. Any pro-
tein must be coded by a gene. However, a gene may code for multiple protein isoforms that differ in sequences
and functions. A protein molecule is built of 20 amino acids, which are the building blocks of any proteins.
Amino acids are small organic molecules that consist of an alpha (central) carbon atom linked to an amino
group (NH2), a carboxyl group (COOH), a hydrogen atom (H), and a variable component called a side chain
given the symbol (R) (Figure 1.30) [47].
In a protein molecule, multiple amino acids are linked together by peptide bonds, thereby forming a
long chain or polypeptide. Peptide bonds are formed by a biochemical reaction that extracts a water mol-
ecule as it joins the amino group of one amino acid to the carboxyl group of a neighboring amino acid.
The linear sequence of amino acids within a protein is considered the primary structure of the protein
(Figure 1.31).
Each amino acid has a unique side chain (R), which has different chemistry from that of other amino
acids. Most amino acids have non-polar side chains, which have pure hydrocarbon alkyl groups (alkane
The Origin of Genomic Information | 25
branches) or aromatic (benzene rings). Amino acids with non-polar side chains include glycine (G), alanine
(A), valine (V), leucine (L), isopleucine (I), methionine (M), phenylalanine (F), tryptophan (W), and proline
(P) (Figure 1.32).
Other amino acids have polar but uncharged side chains. These polar amino acids include serine (S), threo-
nine (T), cysteine (C), tyrosine (Y), asparagine (N), and glutamine (Q) (Figure 1.33).
Other amino acids have side chains with positive (basic amino acids) or negative charges (acidic amino
acids). Those amino acids that have electrically charged side chains include aspartate (D), glutamate (E), lysine
(K), arginine (R), and histidine (H) (Figure 1.34).
The physicochemical properties of the side chains of amino acids are critical to protein structures
because these side chains can bind with one another to fold the linear protein residues in a certain shape or
conformation forming the so-called three-dimensional structure (3D) of protein. Any protein has a specific
three-dimensional structure that determines its biological functions in the living cells. Charged amino
acid side chains can form ionic bonds, and polar amino acids can form hydrogen bonds. Hydrophobic
side chains interact with each other via weak Van der Waals interactions. Most bonds formed by these
side chains are non-covalent. Only cysteines can form covalent bonds with their side chains that contain
26 | The Origin of Genomic Information
sulfur. If a sulfur atom is bonded to a sulfur atom of another cysteine, a covalent disulfide bridge is formed
(Figure 1.35).
The order of amino acids in a protein or a polypeptide is represented by a sequence formed from single-
letter abbreviations of the amino acid forming the molecule. Table 1.5 contains the names of the 20 amino
acids and their three-letter and single-letter abbreviations.
In bioinformatics, the order of amino acids (linked to one another by peptide bonds) in protein molecules
are represented by linear sequences made up of the single-letter symbols of the 20 amino acids as follows:
LTKDRMNVEKAEFCNKSKQPGLARSQHNRWAGSKETCNDRRTPSTEKKVDLNADPLCERKEWNKQK
LPCSENPRDTEDVPWITLNSSIQK
The Origin of Genomic Information | 27
1 Alanine Ala A
2 Arginine Arg R
3 Asparagine Asn N
4 Aspartate Asp D
5 Cysteine Cys C
6 Glutamine Gln Q
7 Glutamate Glu E
8 Glycine Gly G
9 Histidine His H
10 Isoleucine Iso I
11 Leucine Leu L
12 Lysine Lys K
13 Methionine Met M
14 Phenylalanine Phe F
15 Proline Pro P
16 Serine Ser S
17 Threonine Thr T
18 Tryptophan Try W
19 Tyrosine Tyr Y
20 Valine Val V
Primary Structure The primary structure in a protein molecule is the linear sequence of amino acids in a poly-
peptide chain. No links are presented between the amino acid residues but only the peptide bonds that create
the chain (Figure 1.36). The sequence of a protein represents its primary structure.
Secondary Structure The secondary structure of a protein molecule is that the amino acids in the molecules
form hydrogen bonds with one another. Thus, regions of the polypeptide fold due to interactions between
atoms of the backbone atoms. The arrangement and folding of the polypeptide chain may form two secondary
shapes: alpha helix and beta strand.
The alpha-helix (α-helix) is a spiral configuration of the polypeptide chain for a helix structure, which has
either a right direction (right-handed helix) or left direction (left-handed helix). In an alpha helix, the carbonyl
(C=O) of one amino acid forms a hydrogen bond with the amino group (N-H) of another amino acid (e.g., the
carbonyl of amino acid 1 would form a hydrogen bond to the N-H of amino acid 5). Such pattern of hydrogen
28 | The Origin of Genomic Information
FIGURE 1.37 Secondary structures of protein (a) alpha helix and (b) beta sheet.
bonds shapes the polypeptide chain into a helical ribbon with each turn of the helix containing 3.6 amino
acids. The side chains or R groups of the amino acids project outward forming the alpha helix (Figure 1.37a).
The Beta strand (β-sheet) is made of two or more adjacent segments of the same polypeptide chain, forming
a sheet-like structure stabilized by hydrogen bonds (Figure 1.37b). The hydrogen bonds are formed between
the carbonyl groups and amino groups of the backbone of the polypeptide, while the side chains extend above
and below the plane of the sheet. The strands forming the beta sheet can be parallel, pointing to the same
direction (N-and C-termini match), or anti-parallel, pointing in opposite directions (the N-terminus of one
strand is positioned next to the C-terminus of the other).
Tertiary Structure The tertiary structure of a protein molecule is the protein structure at which an entire poly-
peptide folds to form a three-dimensional shape. It is primarily due to interactions between the side chains
(R groups) of the amino acids that make up the polypeptide. The side chain interactions that contribute to the
tertiary structure may include hydrogen bonding, ionic bonding, dipole-dipole, and hydrophobic interactions.
Moreover, the disulfide covalent bonds between the sulfur atoms in the side chain of cysteines in the polypep-
tide contribute as well to the tertiary structure of proteins (Figure 1.38).
The Origin of Genomic Information | 29
Quaternary Structure Some proteins are formed of multiple polypeptide chains or subunits. The quaternary
structure of a protein is formed by the association of several polypeptide chains or subunits into a closely
packed protein arrangement. The subunits are held together by hydrogen bonds and van der Waals forces
between non-polar side chains of the amino acids. Each of the subunits may have its own primary, secondary,
or tertiary structure (Figure 1.39).
30 | The Origin of Genomic Information
GENE MUTATIONS
A gene mutation is the change that takes place in the sequence of a gene, and it contributes to diversity across
organisms and may have widely differing consequences. For mutations to be heritable they must occur in
cells that produce the next generation or germline cells. Such mutation is called germline mutations. On the
other hand, somatic mutations (occur in the somatic cells) are not inheritable, and they only affect the present
organism’s body. The somatic mutations have no significance for the evolution. Evolutionary theory is mostly
interested in germline mutations. Mutations may occur randomly, and the consequence may be harmful,
useful, or have no effect at all. The beneficial gene changes may be conserved for a long term and passed to
offspring. Generally, mutation rates are usually very low, and biological systems go to extraordinary lengths
to keep them as low as possible, mostly because many mutational effects are harmful. Moreover, DNA repair
or proofreading during DNA replication reduces the mutation rates. Mutations can also be due to the natural
exposure of an organism to certain environmental factors, such as radiation or chemical carcinogens.
DNA mutations can occur in several ways. They have varying effects on the living organism, depending on
where they occur and whether they alter the function of essential proteins. When the change involves only one
nucleotide position in the DNA sequence it is called point mutation. DNA mutation can be caused by substi-
tution, insertion, or deletion. In substitution mutation, a nucleotide will be replaced by one of the other three
nucleotides. Substitution mutation that involves a single position is known as single nucleotide polymorphism
(SNP). Substitution mutation can be silent, missense, or nonsense. In silent substitution mutation, the replace-
ment does not change the coded amino acid (synonymous codon). For example, both codons GCA and GCC
code for alanine amino acid (Ala), therefore replacing adenine (A) in GCA with cytosine (C) does not change
the translated amino acid (Ala) and it has no effect on the protein. In missense substitution mutation, the
nucleotide replacement will result in a codon that codes for a different amino acid (nonsynonymous codon).
For example, the codon GGA codes for glycine (Gly), so substituting the first guanine (G) with adenine
(A) will result in AGA codon, which codes for arginine amino acid (Arg). Therefore, it has a consequence in
the protein translation. In a nonsense substitution mutation, the nucleotide substitution will result in a stop
codon that terminates the protein translation before completion and the consequence will be a shorter protein
that does not function properly. For example, the codon CAA codes for glutamine (Gln). If the cytosine (C) in
CAA (in a protein coding region) is replaced by thymine (T), the resulting codon will be TAA (UAA), which
is the stop codon. During translation in ribosome, the translation of protein will be terminated when UAA is
reached. Figure 1.44 shows the three types of substitution.
In insertion mutation, a nucleotide or several nucleotides are inserted in the DNA sequence of a gene while
in deletion mutation a nucleotide or several nucleotides are removed from the DNA sequence. Deletion and
insertion mutation may cause frameshift, which occurs when mutation changes the open reading frame (ORF)
of a gene. The resulting protein is usually nonfunctional.
VIRUS GENOME
Viruses are not living organisms; they lack the characteristics that living organisms have. Living organisms have
definite cells that contain organelles and a metabolism system, and they can provide themselves with energy.
Viruses do not have such ability; however, they can replicate but only inside living cells and can adapt to their
environment and mutate. Since the virus is not a living organism it is defined as a small collection of genetic
code, either DNA or RNA, surrounded by a protein coat called a capsid. The virus genome is surrounded by
a nucleocapsid. Most viruses, but not all of them, are also surrounded by a membrane called the envelope
surrounding the capsid. The virus envelope is usually acquired from the nuclear or cell membrane of the
infected host cell (Figure 1.45) [33].
Viruses are classified based on the Baltimore classification system, which depends on the combination of
their nucleic acid (DNA or RNA), nature of the strand (single-stranded or double-stranded), sense (positive-
sense or negative-sense), and method of replication. The virus genome can be double-stranded DNA (dsDNA),
single-stranded DNA (ssDNA), double-stranded RNA (dsRNA), or single-stranded RNA (ssRNA). The latter
can either be positive-sense single-stranded RNA (+ssRNA), or negative-sense single-stranded RNA (-ssRNA).
The positive-sense RNA strand if the RNA sequence of that strand is translated or translatable into protein
and negative-sense RNA strand if the complementary RNA sequence of that strand, and not the original
strand itself, is translated or translatable into protein [53]. The above criteria are used to classify viruses into
several families such as Adenoviridae (dsDNA), Coronaviridae (ssRNA), Retroviridae (ssRNA), Parvoviridae
(ssDNA), Poxviridae (dsDNA), etc.
Viruses proliferate by infecting cells and hijacking their protein synthesis machinery to replicate, producing
multiple virions that are ready to infect new cells. A virion is a complete virus particle with all virus components.
During infection, a virion first attaches to receptors on the membrane of a living cell. The receptors can be
proteins, carbohydrates, or lipids. Once a virus is attached to a receptor it can enter the cell using one of two
routes: endocytosis or non-endocytosis. In the endocytic route, the virus is coated with a clathrin protein and
then transported in a vacuole into the cell. In non-endocytic route, the virus crosses the plasma membrane at
neutral pH. Some viruses use both entry routes. Viruses such as retroviruses can also enter cells through cell-
to-cell transmission [54].
Once the virus enters the host cell it will start to replicate. The replication depends on the viral genome
(dsDNA, ssDNA, dsRNA, +ssRNA, -ssRNA, etc.). The transcription and translation are like the ones discussed
above; however, viruses use their genomes, but they exploit the protein synthesis machinery to direct the host
cell to synthesize their viral enzymes and capsid proteins and to assemble their new viral genomes. DNA
viruses usually use the host cell proteins and enzymes to make additional DNA that is transcribed into mRNA
translated into proteins. RNA viruses use the RNA strand as the template for mRNA, which is translated into
proteins, and viral genomic RNA for assembling new virions. The RNA-virus Retroviruses, such as HIV, must
be reverse-transcribed into DNA using reverse transcription process, which then is incorporated into the host
cell genome. The process of reverse transcription is mediated by an enzyme known as reverse transcriptase.
This enzyme is used at the laboratory to convert RNA into DNA called complementary DNA (cDNA) [55].
Following synthesis of viral proteins and viral genomes, the genomes are then encapsulated with capsid
protein forming new virions, which are then released from the host cells. The number of virions released from
a single infected cell is called the viral burst size. The virions can infect the adjacent cells and repeat the rep-
lication cycle. Some viruses are released after the death of the host cells, and some leave the infected cells by
passing through the cell membrane without killing the infected cells.
Most viruses cause diseases; some of these diseases are severe, highly contagious and deadly. In past decades,
the world has faced several viral disease outbreaks, such as the Ebola, SARS, and Zika viruses, which have had
a massive death toll and impact on economies. Recently, the world has been struck by the outbreak of COVID-
19, which is caused by a coronavirus called SARS-CoV-2. Up to this date, the world is trying to control its
spread and to reduce its impact.
The COVID-19 infection starts with the attachment stage, in which the virus spike protein binds to the
receptors on the cell membrane. Coronavirus uses the non-endocytic route to enter the host cell. The viral
RNA is translated into large replicase polyproteins (pp1a and pp1ab), which subsequently cleave into viral
NSPs (Nonstructural Proteins). The full anti-sense single-stranded RNA (-ssRNA) copies of the coronavirus
genome are produced by the enzyme replicase using the full +ssRNA virus genome as a template strand. The
spike protein, envelope protein, nucleocapsid protein, and capsid protein are translated from segments of the
mRNA and are used in the assembly of new virions in Golgi body and endoplasmic reticulum of the host cell.
Once virons have been assembled, they can be released by transporting them in vacuoles to the extracellular
to infect new cells [56].
REFERENCES
1. Cochrane, G., et al., The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res, 2011. 39
(Database issue): pp. D15–18.
2. Hetzer, M. and G. Cavalli, Eukaryotic cells. Curr Opin Cell Biol, 2011. 23(3): pp. 255–257.
3. Souza, W., Prokaryotic cells: structural organisation of the cytoskeleton and organelles. Mem Inst Oswaldo Cruz,
2012. 107(3): pp. 283–293.
4. Benbow, R.M., Chromosome structures. Sci Prog, 1992. 76(301-302 Pt 3–4): pp. 425–450.
5. Schreck, R.R. and C.M. Disteche, Chromosome banding techniques. Curr Protoc Hum Genet, 2001.
Chapter 4: Unit4 2.
6. Bolcun-Filas, E. and M.A. Handel, Meiosis: the chromosomal foundation of reproduction. Biol Reprod, 2018.
99(1): pp. 112–126.
7. Garbers, D.L., Molecular basis of fertilization. Annu Rev Biochem, 1989. 58: pp. 719–742.
8. Georgadaki, K., et al., The molecular basis of fertilization (Review). Int J Mol Med, 2016. 38(4): pp. 979–986.
9. Vaillancourt, C. and J. Lafond, Human embryogenesis: overview. Methods Mol Biol, 2009. 550:
pp. 3–7.
10. Margolin, W., FtsZ and the division of prokaryotic cells and organelles. Nat Rev Mol Cell Biol, 2005. 6(11): pp. 862–871.
11. Watson, J.D. and F.H. Crick, Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Clin
Orthop Relat Res, 2007. 462: pp. 3–5.
12. Green, E.D., J.D. Watson, and F.S. Collins, Human Genome Project: Twenty-five years of big biology. Nature, 2015.
526(7571): pp. 29–31.
13. Salzberg, S.L., Open questions: How many genes do we have? BMC Biol, 2018. 16(1): p. 94.
14. Bruford, E.A., et al., Guidelines for human gene nomenclature. Nat Genet, 2020. 52(8): pp. 754–758.
15. Wojcik, F., [Guidelines for human gene nomenclature]. Ann Biol Clin (Paris), 2002. 60(3): pp. 347–350.
36 | The Origin of Genomic Information
16. Stringer, C., The origin and evolution of Homo sapiens. Philos Trans R Soc Lond B Biol Sci, 2016. 371(1698).
17. Spieth, J. and D. Lawson, Overview of gene structure. WormBook, 2006: pp. 1–10.
18. Spieth, J., et al., Overview of gene structure in C. elegans. WormBook, 2014: pp. 1–18.
19. Haberle, V. and A. Stark, Eukaryotic core promoters and the functional basis of transcription initiation. Nat Rev Mol
Cell Biol, 2018. 19(10): pp. 621–637.
20. IUPAC-IUB Commission on Biochemical Nomenclature (CBN). Abbreviations and symbols for nucleic acids,
polynucleotides and their constituents. Recommendations 1970. Eur J Biochem, 1970. 15(2): pp. 203–208.
21. Shu, J.J., A new integrated symmetrical table for genetic codes. Biosystems, 2017. 151: pp. 21–26.
22. Pennisi, E., Genomics. ENCODE project writes eulogy for junk DNA. Science, 2012. 337(6099): pp. 1159, 1161.
23. Fan, H. and J.Y. Chu, A brief review of short tandem repeat mutation. Genomics Proteomics Bioinformatics, 2007.
5(1): pp. 7–14.
24. Ponicsan, S.L., J.F. Kugel, and J.A. Goodrich, Genomic gems: SINE RNAs regulate mRNA production. Curr Opin
Genet Dev, 2010. 20(2): pp. 149–155.
25. Nelson, P.N., et al., Human endogenous retroviruses: transposable elements with potential? Clin Exp Immunol,
2004. 138(1): pp. 1–9.
26. Cordaux, R. and M.A. Batzer, The impact of retrotransposons on human genome evolution. Nat Rev Genet, 2009.
10(10): pp. 691–703.
27. Bourque, G., et al., Ten things you should know about transposable elements. Genome Biol, 2018. 19(1): p. 199.
28. Zheng, D., et al., Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolu-
tion. Genome Res, 2007. 17(6): pp. 839–851.
29. Torrents, D., et al., A genome-wide survey of human pseudogenes. Genome Res, 2003. 13(12): pp. 2559–2567.
30. Bischof, J.M., et al., Genome-wide identification of pseudogenes capable of disease-causing gene conversion. Hum
Mutat, 2006. 27(6): pp. 545–552.
31. Turner, K.J., V. Vasu, and D.K. Griffin, Telomere Biology and Human Phenotype. Cells, 2019. 8(1).
32. Aguado, J., F. d’Adda di Fagagna, and E. Wolvetang, Telomere transcription in ageing. Ageing Res Rev, 2020.
62: pp. 101–115.
33. Lodish H.B.A., S.L. Zipursky, et al., Molecular Cell Biology. 4th ed. 2000. New York: Freeman.
34. Yusupov, M.M., et al., Crystal structure of the ribosome at 5.5 A resolution. Science, 2001. 292(5518): pp. 883–896.
35. Torres-Machorro, A.L., et al., Ribosomal RNA genes in eukaryotic microorganisms: witnesses of phylogeny? FEMS
Microbiol Rev, 2010. 34(1): pp. 59–86.
36. Long, E.O. and I.B. Dawid, Repeated genes in eukaryotes. Annu Rev Biochem, 1980. 49: pp. 727–764.
37. Sollner-Webb, B. and E.B. Mougey, News from the nucleolus: rRNA gene expression. Trends Biochem Sci, 1991.
16(2): pp. 58–62.
38. Rich, A., A Hybrid Helix Containing Both Deoxyribose and Ribose Polynucleotides and Its Relation to the Transfer
of Information between the Nucleic Acids. Proc Natl Acad Sci USA, 1960. 46(8): pp. 1044–1053.
39. Bartel, D.P., Metazoan MicroRNAs. Cell, 2018. 173(1): pp. 20–51.
40. Bartel, D.P., MicroRNAs: target recognition and regulatory functions. Cell, 2009. 136(2): pp. 215–233.
41. Watson, J.D., Molecular Biology of the Gene. 7th ed. 2013. Pearson.
42. Eisenberg, E. and E.Y. Levanon, Human housekeeping genes are compact. Trends Genet, 2003. 19(7): pp. 362–365.
43. Eisenberg, E. and E.Y. Levanon, Human housekeeping genes, revisited. Trends Genet, 2013. 29(10): pp. 569–574.
44. Alberts B.J.A., J. Lewis, M. Raff, K. Roberts, and .P Walter, Molecular Biology of the Cell. 6th ed. 2008.
New York: Garland Science.
45. Pan, Q., et al., Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput
sequencing. Nat Genet, 2008. 40(12): pp. 1413–1415.
46. Cooper, G., The Cell: A Molecular Approach. 8th ed. 2018: Oxford: Oxford University Press.
47. Nelson, D.L., Lehninger Principles of Biochemistry. 7th ed. 2017: New York: W.H. Freeman.
48. Zwanzig, R., A. Szabo, and B. Bagchi, Levinthal’s paradox. Proc Natl Acad Sci USA, 1992. 89(1):
pp. 20–22.
49. Pauling, L., R.B. Corey, and H.R. Branson, The structure of proteins; two hydrogen-bonded helical configurations of
the polypeptide chain. Proc Natl Acad Sci USA, 1951. 37(4): pp. 205–211.
50. Adams, P.D., et al., Announcing mandatory submission of PDBx/mmCIF format files for crystallographic depositions
to the Protein Data Bank (PDB). Acta Crystallogr D Struct Biol, 2019. 75(Pt 4): pp. 451–454.
The Origin of Genomic Information | 37
51. Berman, H.M., The Protein Data Bank: a historical perspective. Acta Crystallogr A, 2008. 64(Pt 1): pp. 88–95.
52. Brown, I.D., CIF (Crystallographic Information File): A Standard for Crystallographic Data Interchange. J Res Natl
Inst Stand Technol, 1996. 101(3): pp. 341–346.
53. Siegel, R.D. and C.G. Prober, Classification of Viruses. Principles and Practice of Pediatric Infectious Disease.
2008. 1001–1005. doi: 10.1016/B978-0-7020-3468-8.50207-8. Epub 2020 June 22.
54. Dimitrov, D.S., Virus entry: molecular mechanisms and biomedical applications. Nat Rev Microbiol, 2004.
2(2): pp. 109–122.
55. Christopher J. Burrell, C.R.H.a.F.A.M., Fenner and White’s Medical Virology. 5th ed. 2016: Cambridge,
MA: Academic Press.
56. Yesudhas, D., A. Srivastava, and M.M. Gromiha, COVID-19 outbreak: history, mechanism, transmission, structural
studies and therapeutics. Infection, 2021. 49(2): pp. 199–213.
The Origin of Genomic Information
Cochrane, G. , et al., The International Nucleotide Sequence Database Collaboration . Nucleic Acids Res, 2011. 39 (Database issue):
pp. D15–18.
Hetzer, M. and G. Cavalli , Eukaryotic cells . Curr Opin Cell Biol, 2011. 23 (3): pp. 255–257.
Souza, W. , Prokaryotic cells: structural organisation of the cytoskeleton and organelles . Mem Inst Oswaldo Cruz, 2012. 107 (3): pp.
283–293.
Benbow, R.M. , Chromosome structures . Sci Prog, 1992. 76 (301-302 Pt 3–4): pp. 425–450.
Schreck, R.R. and C.M. Disteche , Chromosome banding techniques . Curr Protoc Hum Genet, 2001. Chapter 4: Unit4 2.
Bolcun-Filas, E. and M.A. Handel , Meiosis: the chromosomal foundation of reproduction . Biol Reprod, 2018. 99 (1): pp. 112–126.
Garbers, D.L. , Molecular basis of fertilization . Annu Rev Biochem, 1989. 58 : pp. 719–742.
Georgadaki, K. , et al., The molecular basis of fertilization (Review) . Int J Mol Med, 2016. 38 (4): pp. 979–986.
Vaillancourt, C. and J. Lafond , Human embryogenesis: overview . Methods Mol Biol, 2009. 550 : pp. 3–7.
Margolin, W. , FtsZ and the division of prokaryotic cells and organelles . Nat Rev Mol Cell Biol, 2005. 6 (11): pp. 862–871.
Watson, J.D. and F.H. Crick , Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid . Clin Orthop Relat Res,
2007. 462 : pp. 3–5.
Green, E.D. , J.D. Watson , and F.S. Collins , Human Genome Project: Twenty-five years of big biology . Nature, 2015. 526 (7571):
pp. 29–31.
Salzberg, S.L. , Open questions: How many genes do we have? BMC Biol, 2018. 16 (1): p. 94.
Bruford, E.A. , et al., Guidelines for human gene nomenclature . Nat Genet, 2020. 52 (8): pp. 754–758.
Wojcik, F. , [Guidelines for human gene nomenclature] . Ann Biol Clin (Paris), 2002. 60 (3): pp. 347–350.
Stringer, C. , The origin and evolution of Homo sapiens . Philos Trans R Soc Lond B Biol Sci, 2016. 371 (1698).
Spieth, J. and D. Lawson , Overview of gene structure . WormBook, 2006: pp. 1–10.
Spieth, J. , et al., Overview of gene structure in C. elegans . WormBook, 2014: pp. 1–18.
Haberle, V. and A. Stark , Eukaryotic core promoters and the functional basis of transcription initiation . Nat Rev Mol Cell Biol, 2018.
19 (10): pp. 621–637.
IUPAC-IUB Commission on Biochemical Nomenclature (CBN). Abbreviations and symbols for nucleic acids, polynucleotides and their
constituents. Recommendations 1970 . Eur J Biochem, 1970. 15 (2): pp. 203–208.
Shu, J.J. , A new integrated symmetrical table for genetic codes . Biosystems, 2017. 151 : pp. 21–26.
Pennisi, E. , Genomics. ENCODE project writes eulogy for junk DNA . Science, 2012. 337 (6099): pp. 1159, 1161.
Fan, H. and J.Y. Chu , A brief review of short tandem repeat mutation . Genomics Proteomics Bioinformatics, 2007. 5 (1): pp. 7–14.
Ponicsan, S.L. , J.F. Kugel , and J.A. Goodrich , Genomic gems: SINE RNAs regulate mRNA production . Curr Opin Genet Dev, 2010.
20 (2): pp. 149–155.
Nelson, P.N. , et al., Human endogenous retroviruses: transposable elements with potential? Clin Exp Immunol, 2004. 138 (1): pp.
1–9.
Cordaux, R. and M.A. Batzer , The impact of retrotransposons on human genome evolution . Nat Rev Genet, 2009. 10 (10): pp.
691–703.
Bourque, G. , et al., Ten things you should know about transposable elements . Genome Biol, 2018. 19 (1): p. 199.
Zheng, D. , et al., Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution . Genome
Res, 2007. 17 (6): pp. 839–851.
Torrents, D. , et al., A genome-wide survey of human pseudogenes . Genome Res, 2003. 13 (12): pp. 2559–2567.
Bischof, J.M. , et al., Genome-wide identification of pseudogenes capable of disease-causing gene conversion . Hum Mutat, 2006. 27
(6): pp. 545–552.
Turner, K.J. , V. Vasu and D.K. Griffin , Telomere Biology and Human Phenotype . Cells, 2019. 8 (1).
Aguado, J. , F. d’Adda di Fagagna and E. Wolvetang , Telomere transcription in ageing . Ageing Res Rev, 2020. 62 : pp. 101–115.
Lodish H.B.A. , S.L. Zipursky , et al., Molecular Cell Biology. 4th ed. 2000 . New York: Freeman.
Yusupov, M.M. , et al., Crystal structure of the ribosome at 5.5 A resolution . Science, 2001. 292 (5518): pp. 883–896.
Torres-Machorro, A.L. , et al., Ribosomal RNA genes in eukaryotic microorganisms: witnesses of phylogeny? FEMS Microbiol Rev,
2010. 34 (1): pp. 59–86.
Long, E.O. and I.B. Dawid , Repeated genes in eukaryotes . Annu Rev Biochem, 1980. 49 : pp. 727–764.
Sollner-Webb, B. and E.B. Mougey , News from the nucleolus: rRNA gene expression . Trends Biochem Sci, 1991. 16 (2): pp. 58–62.
Rich, A. , A Hybrid Helix Containing Both Deoxyribose and Ribose Polynucleotides and Its Relation to the Transfer of Information
between the Nucleic Acids . Proc Natl Acad Sci USA, 1960. 46 (8): pp. 1044–1053.
Bartel, D.P. , Metazoan MicroRNAs . Cell, 2018. 173 (1): pp. 20–51.
Bartel, D.P. , MicroRNAs: target recognition and regulatory functions . Cell, 2009. 136 (2): pp. 215–233.
Watson, J.D. , Molecular Biology of the Gene. 7th ed. 2013. Pearson.
Eisenberg, E. and E.Y. Levanon , Human housekeeping genes are compact . Trends Genet, 2003. 19 (7): pp. 362–365.
Eisenberg, E. and E.Y. Levanon , Human housekeeping genes, revisited . Trends Genet, 2013. 29 (10): pp. 569–574.
Alberts B.J.A., J. Lewis, M. Raff, K. Roberts, and .P Walter, Molecular Biology of the Cell. 6th ed. 2008. New York: Garland Science.
Pan, Q. , et al., Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing . Nat
Genet, 2008. 40 (12): pp. 1413–1415.
Cooper, G. , The Cell: A Molecular Approach. 8th ed. 2018: Oxford: Oxford University Press.
Nelson, D.L. , Lehninger Principles of Biochemistry. 7th ed. 2017: New York: W.H. Freeman.
Zwanzig, R. , A. Szabo , and B. Bagchi , Levinthal’s paradox . Proc Natl Acad Sci USA, 1992. 89 (1): pp. 20–22.
Pauling, L. , R.B. Corey , and H.R. Branson , The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide
chain . Proc Natl Acad Sci USA, 1951. 37 (4): pp. 205–211.
Adams, P.D. , et al., Announcing mandatory submission of PDBx/mmCIF format files for crystallographic depositions to the Protein
Data Bank (PDB) . Acta Crystallogr D Struct Biol, 2019. 75 (Pt 4): pp. 451–454.
Berman, H.M. , The Protein Data Bank: a historical perspective . Acta Crystallogr A, 2008. 64 (Pt 1): pp. 88–95.
Brown, I.D. , CIF (Crystallographic Information File): A Standard for Crystallographic Data Interchange . J Res Natl Inst Stand
Technol, 1996. 101 (3): pp. 341–346.
Siegel, R.D. and C.G. Prober , Classification of Viruses . Principles and Practice of Pediatric Infectious Disease. 2008. 1001–1005.
doi: 10.1016/B978-0-7020-3468-8.50207-8 . Epub 2020 June 22.
Dimitrov, D.S. , Virus entry: molecular mechanisms and biomedical applications . Nat Rev Microbiol, 2004. 2 (2): pp. 109–122.
Christopher J. Burrell , C.R.H.a.F.A.M., Fenner and White’s Medical Virology. 5th ed. 2016: Cambridge, MA: Academic Press.
Yesudhas, D. , A. Srivastava and M.M. Gromiha , COVID-19 outbreak: history, mechanism, transmission, structural studies and
therapeutics . Infection, 2021. 49 (2): pp. 199–213.