Introduction A La Bioinformatique
Introduction A La Bioinformatique
BIOINFORMATICS
INTRODUCTION TO
BIOINFORMATICS
Angshuman Bagchi
α
Alpha Science International Ltd.
Oxford, U.K.
Introduction to Bioinformatics
168 pgs.
Angshuman Bagchi
Department of Biochemistry and Biophysics
University of Kalyani
Nadia
Copyright © 2018
ALPHA SCIENCE INTERNATIONAL LTD.
7200 The Quorum, Oxford Business Park North
Garsington Road, Oxford OX4 2JZ, U.K.
www.alphasci.com
ISBN 978-1-78332-377-7
Preface v
1. Introduction 1.1
1.1 History of Bioinformatics 1.2
1.2 Biological Databases 1.3
1.3 Biological Sequence Analysis 1.6
2. Biological Databases
2.1 Introduction 2.1
2.2 Some important Terminologies and Definitions 2.1
2.3 Primary Sequence Databases 2.3
2.4 The Nucleotide Sequence Databases 2.3
2.5 Final Words 2.9
2.6 The Protein Sequence Databases 2.9
2.7 The Protein Structure Database 2.15
2.8 Combination Databases 2.18
Some Important Questions and Answers 2.23
3. Biological Sequence Alignment
3.1 Introduction 3.1
3.2 Some Important Terms 3.1
3.3 Biological Importance of Sequence Alignment 3.3
viii Contents
1 Introduction
The rapid progress in gene and protein identification and sequencing led
to the collection of information related to genomics, transcriptomics,
proteomics, metabolomics and this required those data to be stored
which can be accessible and later updated and retrieved easily from all
over the world.
The alignments of more than two protein or DNA sequences are very
useful to study the homologous group and it provides structure-function
and evolution of the protein or gene families. It is more sensitive than
pairwise alignment to detect homologs, to find conserved domain or
motifs or consensus regions. The most commonly used algorithm for
multiple sequence alignment is derived from Feng and Doolittle’s (1987,
1990) progressive alignment method. The basic strategy of this method
is calculaing pairwise alignment scores between all sequences and then
align two closely related sequences followed by addition of sequences
progressively to the alignment. The most popular web version of multiple
sequence alignment tool is ClustalW program. To generate a guide tree,
this program uses mostly distance matrix instead of similarity matrix.
The two main features of a guide tree are its branching order or topology
and branching length which is proportional to evolutionary distance. The
two main methods of tree construction are:
(a) unweighted pair group method of arithmatic averages (UPGMA)
and
(b) neighbour-joining method.
The multiple sequence alignment output arranges the sequences as
presented in the guide tree, i.e., two closely related sequences create a
pairwise alignment and then sequences are added one by one. The newer
web version of multiple sequence alignment is Clustal Omega which
uses seeded guide trees and HMM profile-profile techniques to generate
alignments.
1.10 Introduction to Bioinformatics
A star like tree is generated and then pairwise comparisons are made to
identify two closely related sequences, the neighbors, followed by joining
them until the topology of the full tree is completed.
1.3.3.2 Maximum Likelihood (ML) Method
again searched using blastp and new functional domains are found. The
relative mutational probabilities of different amino acid residues are
different according to their genetic codes. Some amino acid residues
such as asparagine and serine undergo substitutions very frequently,
while other residues (notably tryptophan and cysteine) are very rarely
mutated. Dayhoff et al calculated the relative mutabilities of amino acid
residues which describe how often each amino acid is likely to change
over a short evolutionary period. To calculate the relative mutability they
divided the number of times each amino acid was observed to mutate by
the overall frequency of occurrence of that amino acid. The less mutable
residues probably have important structural and functional roles in
proteins such that replacing them resulted in some significant changes.
The most common substitutions are Glu for Asp (both acidic), Ser for
Ala and Thr (both are hydroxylated), and Ile for Val (both hydrophobic
and of similar sizes). The relative mutabilities depend on to the physico-
chemical properties of the amino acids. With reference to the genetic code
it is also explained in such a way that common amino acid substitutions
tend to require only single nucleotide change and least mutable residues
are coded by only one or two codons. The low mutability of these amino
acids suggest that substitutions are not tolerated by natural selections.
Secondary Structure Prediction: Secondary structure prediction
techniques aim to predict the local secondary structures of proteins based
on the knowledge of their sequences. The first generation secondary
structure prediction methods in 1960-1970s were based on probabilities
of a particular amino acid for a particular secondary structure. The method
is referred to as Chou-Fasman method. The second generation methods
untill early 1990s consider the local environment of the adjacent residues
typically 3-51 segments. Modern protein secondary structure prediction
methods are based on exploiting evolutionary information contained
in multiple sequence alignments such as recognizing the patterns of
hydrophobicity, conservation, sequence edge effects etc. In relation to
the databases of known protein structures and modern machine learning
methods, such as neural networks, multiple linear regressions, k-nearest
neighborhood and support vector machines, the accuracy of secondary-
structure prediction is raised up to 80% for globular proteins. Further boost
Introduction 1.13
bridge the gap between the available sequence and structure information
by providing reliable and accurate models.
There are mainly three major methods for prediction of three
dimensional structure of protein.
1. Homology modelling or comparative modelling
2. Theading /fold recognition
3. Ab-initio structure prediction
The details of these methods are presented in Chapter 4.
Model Refinement/Optimizaion: The models of the biomolecules
obtained from the aforementioned techniques may have many errors.
Therefore, it is strongly recommended to check the stereo-chemical
qualities of the models. Such methods are called model optimizations.
Model optimization is applied for the whole protein not the isolated
loop regions. Energy minimization is done either by restraining the atom
positions and /or only for a few hundred steps. Molecular dynamics
simulation removes big errors but inefficient force-fields may introduce
atom clashes or small errors.
Validation of Models: Main sources of errors in Homology
Modelling are
1. Poor alignment: The target template sequence identity is very less.
When the target-template sequence identity is 90% and above then
model accuracy is comparable with crystal structures. At the 50-
90% identity range there are around 1.5Å rms deviations between
the structures of the target and the template with considerable
larger local errors but below 25% sequence identity the generated
model might have large errors.
2. Errors in template: If the template itself contains errors then the
generated model quality would be very low.
The structure of modeled protein must be validated through validation
server: Structure Analysis and Verification Server (SAVES) which
contains software programs for checking and validating protein structures
Introduction 1.17
during and after model refinement. The following programs are generally
used for checking model qualities.
Procheck: The software tool checks the stereo-chemical qualities of
protein structures through Ramachandran Plot by analyzing overall and
residue by residue geometry. The plots include: Ramachandran plots,
both for the protein as a whole and for each type of amino acid; c1 – c2
plots (side chain torsion angles) for each amino acid type; main-chain
bond lengths and bond angles; secondary structure plot; deviations from
planarity of planar side chains and so on. Another important check is the
presence of bad contacts which is the distance between any pair of non-
bonded atoms being smaller than the sum of their van der Waals radii.
What_check: The tool provides an enormous number of checks
and detailed overall summary of the stereo-chemical parameters of the
residues in the model. Directional Atomic Contact Analysis implemented
in this program is used to analyze non-bonded contacts.
Verify3D: The tool examines the compatibility of an atomic model
(3D) with its own amino acid sequence (1D) by assigning a structural
class based on its location and environment (alpha, beta, loop, polar,
non-polar etc) and comparing the results to good structures.
ERRAT: The tool analyzes the statistics of non-bonded atom-atom
interactions with respet to a database of reliable high resolution structures.
Prove: The tool compares atomic volumes against a set of pre-
calculated standard values where atoms are treated like hard spheres
and calculates a statistical Z-score deviation for the model from highly
resolved and refined PDB-deposited structures.
Rampage: The tool provides stereo-chemical quality of protein
structure in Ramachandran Plot, where residues having proper y and
F main-chain torsion angles are clustered in allowed regions and the
percentage of residues in the disallowed regions are given.
In this chapter a brief overview of the different aspects of
Bioinformatics are provided. The details of these topics are presented in
the subsequent chapters.
1.18 Introduction to Bioinformatics
2 Biological Databases
2.1 INTRODUCTION
The basic aim of this chapter is to deal with vaious aspetcs of Biological
Databases emphasizing on the distinction between datatypes, data
organizations and usablity of the data. The chapter focuses on sequence
and structural databases. Necessary definitions are also provided. Web
addresses of the biological databases are also added.
Many journals have made it mandatory for authors to submit the newly
found nucleotide sequence information to the EMBL, GenBank or DDBJ
database prior to publication in order to ensure its availability to scientists.
The data can be submitted via the webtool http://www.ebi.ac.uk/embl/
Submission/webin.html.
Database entries produced at the sequencing site can be deposited and
updated directly by the submitters using FTP or email. Groups producing
large volumes of genome sequence data over an extended period of time
are advised to contact the database at ku.ca.ibe@sbusatad.
Sequence manipulations
The EMBL Nucleotide Sequence Database can be accessed via the EBI
SRS server at http://srs.ebi.ac.uk/. In SRS, the data are available in the
following libraries:
(i) EMBL: the database in its entirety by means of a virtual library
comprising EMBLRELEASE, EMBLNEW, EMBLTPA and
EMBLWGS;
(ii) EMBLRELEASE: library containing the latest official release of
the EMBL Nucleotide Sequence Database;
(iii) EMBLNEW: library containing updated and new entries created
since the last official release;
(iv) EMBLTPA: library containing TPA entries;
(v) EMBLWGS: library containing WGS entries;
(vi) EMBLCON: library containing CON entries.
Sequence searching
For Phylogenetic s:
PIR database is also linked with litterature database to search for suitable
bibliographic references for the proteins and their annotations. As sub
databases PIR stores some other important information in terms of:
NREF: It is a non-redundant reference database. This database
maintains constant cross-talking with other existing databases like Swiss-
Prot, PDB etc. The NREF database can be used in searching for sequences
using various sequence analyses tools like BLAST.
iProClass: This is an integrated database of protein family, function,
structure information, pathways, interactions, genes and gene ontology
information and taxonomic results.
iPTMnet: This is a resource that provides information about the
various types of post-translational modifications in proteins. This resource
covers the Systems Biology aspects of proteins.
iProLINK (integrated Protein Literature, INformation and
Knowledge): This is a resource that links the litterature references of
proteins with the sequence information available in PIR. The resource is
aimed at providing a wholesum knowledge of the proteins.
PIRSF: This is a tool that is meant to gather information from other
resources, viz., UniProtKB in a comprehensive manner. This tool is also
used to derive evolutionary relationships from the amino acid sequences
present in the UniProtKB. One of the interesting features of the tool is to
analyze the protein relationships using the entire protein. This is unique
in the sense that other such resources perform the task using only the
specific domain information in the protein. The tool helps in proper
annotations of biological functions of proteins for which specific domain
information is not available.
PRO: This tool is called the protein ontology tool. This tool is
specific for analyzing the gene ontological details of the available
amino acid sequences of the proteins. The ontological information
provide information about the specific gene ontology terms like protein
complex clusters, protein othologous isoforms etc. The tool also provides
information about the protein product in taxon specific and taxon neutral
manner. There are sub ontologies of PRO. They are
Biological Databases 2.11
NC_ The entry having this accession code signifies complete genomic
molecule, usually reference genome assembly. The reference genome
is selected as a representative from the set of best genomic sequences
available
NG_ The entry having this accession code signifies incomplete genomic
molecule
NT_ The entry having this accession code signifies genomic contigs obtained
from cloning experimentations. A sequence contig is a continuous stretch
of sequences bounded by gap or a run of 10 or more same nucleotides
NM_ The entry having this accession code signifies data pertaining to m-RNA
molecules
NR_ The entry having this accession code signifies data pertaining to non-
coding RNA molecules
XM_ The entry having this accession code signifies data pertaining to
predicted protein coding m-RNA molecules
XR_ The entry having this accession code signifies data pertaining to
predicted non-coding RNA molecules
AP_ The entry having this accession code signifies data pertaining to proteins
NP_ The entry having this accession code signifies data pertaining to proteins
obtained from genomes represented by AC_ or NC_
yeast, plants, flies, other animals and humans. The database not only
stores static information but contains several tools to process and analyze
the data. In general, a PDB entry contains the sequence and structural
information together. It also provides the experimental details of the
structure determination. The amino acid sequences in a PDB entry are
linked to their corresponding structural details. A PDB entry contains the
necessary annotations which provide information for the structure and
function of the protein. A PDB entry is linked to various other structural
databases like CATH and SCOP as well the litterature database Pubmed.
The tools available in PDB would help to find the proteins having similar
sequences and structures to an existing protein. The pdb data can be
freely downloaded for analysis purposes.
There are some other databases which provide information about all
types of biological systems. Some of those databases are Gene Ontology
Database (GO Database), DAVID (the Database for Annotation,
Visualization and Integrated Discovery), Pubmed,
2.8.1 G
ene Ontology Database (GO Database) (http://www.
geneontology.org/page/go-database)
It is a major computational initiative to collect and present the different
features of genes and gene products from all species. The main aims
behind the development of the database are manifold.
1. To maintain and develop a controlled vocabulary of various
features of genes and gene products
Biological Databases 2.19
2.8.2. D
AVID (the Database for Annotation, Visualization and
Integrated Discovery) (https://david.ncifcrf.gov/home.jsp)
DAVID (the Database for Annotation, Visualization and Integrated
Discovery) is a repository developed by the Laboratory of
Immunopathogenesis and Bioinformatics (LIB). The tools present in
the DAVID Bioinformatics Resources help the user to provide functional
annotations of large lists of genes derived from genomic studies,
e.g. microarray and proteomics studies. The DAVID Bioinformatics
Repository comprises of the DAVID Knowledgebase and five other
integrated, web-based functional annotation tools:
• DAVID Gene Functional Classification Tool
• DAVID Functional Annotation Tool
• DAVID Gene ID Conversion Tool
• DAVID Gene Name Viewer
• DAVID NIAID Pathogen Genome Browser.
2.20 Introduction to Bioinformatics
FINAL WORDS
The biological sequences are not limited to nucleic acids and proteins
but there are carbohydrates as well. None-the-less, the advents of new
sequences are inreasing day-by-day. The databases are used to store the
large numbers of sequence information. The databases are constantly
being upgraded and there are cross-talks between the different databases.
However, not all the databases have the proper annotations of biological
data. The authenticity of the data may therefore be checked before
experimentations.
A few other important biological databases are mentioned in the
following sections:
Carbohydrate Structure Databases
(d) Entrez
ANSWERS
3.1 INTRODUCTION
G
A
T
T
A
C
A
Biological Sequence Alignment 3.5
FINAL WORDS
There are many methods to determine the biological sequence alignments.
All the methods are fairly accurate. The approximations used in the
algorithms are also quite good. The new methods that are being developed
are extensions of the previous methods. Not only the sequences of DNA
and proteins are analyzed but the sequences of RNA, such as expressed
sequence tags and full-length mRNAs, are also aligned to sequenced
genomes in order to find the locations of genes. This sequencing also
provides information about alternative splicing and RNA editing. The
biological sequence alignment techniques are also important parts of
genome assembly. In genome assembly techniques, the sequences are
aligned to the whole genomes to find overlapping portions. These would
provide information about contigs. The contigs are long stretches of
sequences. The analysis of nucleic acid sequences would also provide
information about the Single Nucleotide Polymorphism (SNP).
The techniques for analyses of the biological sequences have found
their applications in non-biological fields as well. The most important of
such applications is in natural language processing and in social sciences.
In these fields, the Needleman-Wunsch algorithm is usually used to
identify the optimal matching. The word selection methods in natural-
language processing were dependent on the process of multiple sequence
alignment techniques from bioinformatics to generate linguistic versions
of computer-generated mathematical proofs. On the other hand, in the
field of historical and comparative linguistics, sequence alignment tools
have been used to partially automate the comparison method by which
linguists traditionally reconstruct languages. Interestingly, business and
marketing professional also use multiple sequence alignment techniques
to analyze a series of purchases over time.
3.10 Introduction to Bioinformatics
ANSWERS
4 Molecular Modelling
4.1 INTRODUCTION
There are 20 different amino acids. They are classified based on the
physico-chemical characteristics of their side chains. The various
physicochemical properties of side chains are:
Size, shape, hydrophobicity, charge, hydrogen bonding
The general properties of amino acids:
Glycine (Gly) (G): The amino acid G has the following characteristics:
• It is the smallest amino acid
• It has no side chain
• It is highly flexible and is commonly observed in turns
• It is generally present as the outlier in Ramachandran plot
• It is ambivalent. It can be located inside or outside of the protein
structure; buried or solvent exposed
Alanine (Ala) (A): The amino acid A has the following characteristics:
• It is small
• It is ambivalent. It can be located inside or outside of the protein
structure; buried or solvent exposed
Valine (Val) (V) : The amino acid V has the following characteristics:
• It has a large side chain
• It is hydrophobic
• It remains usually buried in the protein cor58e
Leucine (Leu) (L): The amino acid L has the following characteristics:
• It has a large side chain with an additional –CH2 group attached
to the side chain as compared to V. It is larger than V.
• It is hydrophobic
• It remains usually buried in the protein core
Isoleucine (Ile) (I): The amino acid I has the following characteristics:
4.6 Introduction to Bioinformatics
Serine (Ser) (S): The amino acid S has the following characteristics:
• The amino acid contains a -OH group in the side chain
• The side chain of the amino acid remains uncharged at
physiological pH
Threonine (Thr) (T): The amino acid T has the following
characteristics:
• The amino acid has a small, hydrophilic side chain
• The amino acid takes part in post-translational modifications
• This amino acid is found to be present in many enzymes
• This amino acid is also involved in metal ion binding
Tyrosine (Tyr) (Y): The amino acid Y has the following characteristics:
• This amino acid has an aromatic side chain with a hydrophilic
–OH group attached to the aromatic ring
• This amino acid helps in identification of proteins by UV-Vis
spectroscopy
• It may remain buried in the protein core or exposed to the solvents
• This amino acid is found in be present in many enzymes
• The amino acid takes part in post-translational modifications
Tryptophan (Trp) (W): The amino acid W has the following
characteristics:
• It has the largest side chain
• The side chain of this amino acid is aromatic
• This amino acid helps in identification of proteins by UV-Vis
spectroscopy
• The nitrogen atom present in the side of the amino acid can form
hydrogen bonds
• This amino acid is found to be present in many enzymes
• The nitrogen atom of the side chain often remains exposed to
surface
4.8 Introduction to Bioinformatics
Patch PI = sum of the PIs of the amino acid residues in the patch /
number of amino acid residues in the patch
A few aspects of protein-ligand interactions
Ligand is a substance that binds to a biomolecule, mainly proteins, to
serve some specific purposes. In protein-ligand binding, the ligand is
usually a signaling molecule that binds to a specific site on a target protein.
In case of protein-ligand binding the binding occurs by intermolecular
forces, such as ionic bonds, hydrogen bonds and van der Waals forces.
However, association between the ligand and its receptor protein is
usually reversible. It is well established that an actual irreversible covalent
bonding between a ligand and its target molecule is rare in biological
systems.
A ligand binding to its receptor protein alters its chemical conformation
i.e., the three-dimensional shape of the receptor protein. It is known that
the conformational state of a receptor protein determines its functional
state. In biological systems, the ligands can be of different types. They
include substrates, inhibitors, activators, and neurotransmitters. The
tendency or strength of ligand binding to its receptor protein is called
affinity. The binding affinity of ligands is determined not only by direct
interactions, but also by solvent effects that are known to play a dominant
indirect role in driving non-covalent receptor-ligand bindings in solutions.
The interactions of most ligands with the binding sites of their
corresponding receptors can be characterized in terms of binding affinity.
In general, high-affinity ligand binding results from greater intermolecular
forces between the ligand and its receptors. On the other hand, low-
affinity ligand binding involves less intermolecular force between the
ligand and its receptor. Another important aspect of high affinity binding
is that high-affinity binding involves a longer residence time for the
ligand at its receptor binding site whereas the low-affinity binding leads
to comparatively lesser binding time. High-affinity binding of ligands to
their receptors is often physiologically important and in these cases some
of the binding energy can be used to cause a conformational change in
the receptor site. This results in some altered behavior of an associated
ion channel or enzyme.
Molecular Modelling 4.13
A ligand is called the agonist for that particular receptor if the ligand
can bind to the receptor, alter the function of the receptor, and trigger
a physiological response. The agonist for a specific receptor can be
characterized on the basis of how much physiological response can be
triggered and in terms of the concentration of the agonist that is required
to produce the physiological response. For agonists binding to high-
affinity ligand binding sites in receptors, a relatively low concentration
of a ligand is adequate to maximally occupy a ligand-binding site and
trigger a physiological response. On the other hand, low-affinity binding
implies that a relatively high concentration of a ligand is required before
the binding site in the receptor is maximally occupied and the maximum
physiological response to the ligand is achieved.
On a similar note, an antagonistic ligand is the one that after binding
to its receptor fails to stimulate the receptor and brings about a change
in the receptor.
The ligands can be of the type selective and non-selective. Selective
ligands have their tendencies to bind to very limited types of receptor
proteins. On the other hand, non-selective ligands bind to several types
of receptor proteins. This has got a huge pharmaceutical importance.
Drugs that are non-selective tend to have more adverse effects, because
they bind to several other receptors in addition to the one generating the
desired effect. This leads to side effects.
There is another class of ligands referred to as bivalent ligands. The
bivalent ligands consist of two connected molecules as ligands, and are
used in scientific research to detect receptor dimers and to investigate
their properties. Bivalent ligands are usually large molecules and they
tend not to be ‘drug-like’, limiting their applicability in clinical settings.
Molecular modelling techniques utilize all the basic definitions of
protein structures and come up with plausible models of single proteins,
protein-protein and protein-ligand complexes. In the next few pages the
basic principles of molecular modelling techniques will be described. As
mentioned earlier molecular modelling techniques depend on molecular
mechanics and quantum chemistry principles.
4.14 Introduction to Bioinformatics
forces within the system. The molecular system’s potential energy (E) in
a given conformation is a sum of individual energy terms.
E= Ecovalent + E noncovalent
where, the components of the covalent and non covalent contributions
are given by the following summations:
Ecovalent = Ebond + Eangle + E dihedral
Enoncovalent = Evan der Waals +Eelectrostatic
One functional form of force field is
V (r ) = ∑ k (b − b ) + ∑ k (θ − θ )
bonds
b 0
2
angles
θ 0
2
+ ∑
dihedrals
kφ (1 + cos(nφ − φ0 )) ∑
impropers
kψ (ψ − ψ 0 ) 2
σij 12 σij qi q j
+ ∑ 4εj − + ∑
non-bonded rij rij non-bonded ε D rij
pairs (i, j ) pairs (i, j )
of a bulk system with no surfaces present, unless the behaviour near the
walls (surface) are of interest. The bulk is assumed to be composed of
the primary / unit cell surrounded by its exact replicas and thus forms an
infinite lattice in 3D space. The image cells not only have the same size
and shape as the primary one but also contain particles that are images of
the particles in the primary cell. Moreover the cells are space filling, i.e.,
separated by open boundaries so particles can freely enter or leave any
cell in coordination so that total number of particles remain conserved.
The number of image cells needed depends on the range of intermolecular
forces. When the forces are sufficiently short ranged (e.g. in truncated
Lennard-Jones model), only image cells that adjoin the primary cell are
needed (minimum image convention). For squares in two dimensions,
there are eight adjacent image cells whereas for cubes in three dimensions,
the number of adjacent images is twenty-six. In principle, any cell shape
can be used, provided it fills all of space by translation operations of the
central box in three dimensions. Five shapes satisfy this condition: the
cube (and its close relation, the parallelepiped), the hexagonal prism,
the truncated octahedron, the rhombic dodecahedron and the ‘elongated’
dodecahedron. The truncated octahedron and the rhombic dodecahedron
provide periodic cells that are approximately spherical, thus reduces
the number of solvent molecules needed to be added to solvate the
solute and saves CPU time as well. But it is often sensible to choose a
periodic cell that reflects the underlying geometry of the system like the
hexagonal prism is appropriate for solvated DNA. For some simulations,
it is inappropriate to use standard PBC in all directions. PBC may also
cause difficulties when simulation is performed in homogeneous systems
or systems that are not at equilibrium. For example, when studying the
adsorption of molecules onto a surface, it is clearly inappropriate to use the
usual PBC for motion perpendicular to the surface, although it is applied
to motion parallel to the surface. PBCs are not always used in simulation
in such systems like liquid droplets or van der Waals clusters as these
systems inherently contain a boundary. Moreover, to avoid interference
from non-bonded interactions, according to minimum image convention,
the periodic images must be separated by a distance equal to or larger
than a cut-off distance. In practice, a solute-box distance equal to the
Molecular Modelling 4.19
the formula is applied again to make another change in the geometry, until
the minimum is reached. In practice, it is rarely possible to identify the
exact location of minima and/or saddle point. Minimisations are usually
converged by tracking,
(a) energy difference between successive steps, or
(b) root mean square deviation (RMSD) between successive
configurations, or
(c) calculating root mean square gradients of the energy with respect
to the coordinates.
Derivative minimization algorithms: In this method, it is obvious
to calculate the derivatives of the energy wih respect to the variables,
i.e., the cartesian or internal coordinates. It is more useful as it provides
information about the shape of the energy surface and if used properly
it could significantly enhance the efficiency with which the minima are
located. The direction of the first derivative of the energy, the gradient,
indicates where the minimum lies, and the magnitude of the gradient
indicates the steepness of the local slope. The energy of the system can
be lowered by moving each atom in response to the force the acting on it.
The second derivatives indicate the curvature of the function, information
that can be used to predict where the function will change direction, i.e.
pass through a minimum or some other stationary point.
First-order derivative methods use the first derivatives, i.e., the
gradients, whereas second-order methods use both first and second
derivatives. The Newton-Raphson method is the simplest second-order
method.
First–order minimization methods: Two most frquently used
first-order minimisation algorithms are steepest descent and conjugate
gradients methods.
Steepest Descent (SD): Steepest descent method moves in the way
in which the geometry is first minimized in the direction parallel to
the net force, i.e., opposite to the direction in which the gradient is the
largest or, steepest at the initial point, hence its name is steepest. Once
a minimum in the first direction is reached, a second minimization is
Molecular Modelling 4.21
carried out starting from that point and moving in the steepest remaining
direction. This process continues until a minimum has been reached in
all directions within a sufficient tolerance. The direction of the gradient
is determined by the largest inter-atomic forces. So SD is a good method
for relieving the highest-energy features in an initial configuration. The
method is generally robust even when the starting point is far from a
minimum, where the harmonic approximation to the energy surface is
often a poor assumption. However, it is known for slow progress near
minima. In SD method both the gradients and the direction of successive
steps are orthogonal and show oscillatory behaviour.
Conjugate gradient method (CG): Nonlinear conjugate gradient
methods form another popular type of energy minimization scheme for
large-scale problems where memory and computational performance are
important. In CG method, the first portion of the search takes place in
the direction parallel to the net force, just as in SD. However, to avoid
oscillatory behavior of SD as it moves toward the narrow minima, the
CG method uses the information from the previous direction in the next
search. This allows the method to move rapidly to the minimum in fewer
steps. In CG, the gradients at each point are orthogonal but the directions
are conjugate and hence the name. The line search method is usually
used, although an arbitrary step method is also possible. CG method is
usually attributed as the best choice for general use, even though time per
iteration is longer compared to steepest descents for complex systems.
Constrained dynamics
The conformational behavior of a flexible molecule is usually a complex
superposition of different motions. The high frequency motions (e.g. bond
vibrations) are usually of less interest than the lower frequency modes,
which often correspond to major conformational changes. Unfortunately,
the time step of a molecular dynamics simulation is dictated by the
highest frequency motion present in the system. It would therefore
be of considerable benefit to be able to increase the time step without
prejudicing the accuracy of the simulation. Constraint dynamics enables
individual internal coordinates or combinations of specified coordinates
4.24 Introduction to Bioinformatics
The first two steps as mentioned above are often essentially performed
together. This is because the most common methods of identifying
templates rely on the production of sequence alignments. The key to a
good model quality is to have a proper sequence alignment.
The step of model construction can be done in several different
ways. Given a template and a sequence alignment between the target
and the template, the information contained therein is used to generate a
three-dimensional structural model of the target, represented as a set of
Cartesian coordinates for each atom in the protein. Three major classes
of model generation methods have been proposed:
(a) Fragment assembly: This is the original method of homology
modelling. This method relied on the assembly of a complete
model from conserved structural fragments identified in closely
related solved structures. As for an example, a modelling study of
serine proteases in mammals identified a sharp distinction between
“core” structural regions conserved in all experimental structures
in the class, and variable regions typically located in the loops
where the majority of the sequence differences were localized.
Thus, the target proteins could be modeled by first constructing
the conserved core and then substituting variable regions from
other proteins in the set of solved structures. However, the up-to-
date implementations of this method differ mainly in the way they
deal with regions that are not conserved or that lack a template.
The variable regions, i.e, the loop regions are often constructed
with the help of fragment libraries.
(b) Segment matching: This method divides the entire target protein
into short segments of amino acid residues and each segment is
matched with different structural templates. The selection of the
template is done with sequence similarity, comparison of alpha
carbon coordinates and steric clashes between the side chain
atoms.
(c) Satisfaction of spatial restraints: The most commonly used
technique of homology modelling is by the satisfaction of
spatial restraints. This method of homology modelling takes
Molecular Modelling 4.29
When the sequence identity is in the range of 30–50%, errors can be more
severe and are often located in loops. Below 30% sequence identity,
serious errors occur which sometimes result in the basic fold being mis-
predicted. This low-identity region is often referred to as the “twilight
zone” within which homology modelling is extremely difficult. In such
cases homology modelling methods are possibly less suited than fold
recognition.
The other less popular molecular modelling techniques are Threading
and Ab Initio methods.
(b) Threading
X-ray crystallography
http://point.bioinformatics.tw/
http://www.compbio.dundee.ac.uk/www-pips
http://www.jcvi.org/mpidb/about.php
http://www.molecularconnections.com/home/en/home/products/
NetPro
http://www.proteinlounge.com/inter_home.asp
http://itolab.cb.k.u-tokyo.ac.jp/Y2H/
http://mips.gsf.de/genre/proj/mpact/index.html
(a) L-alanyl-L-phenylalanyl-glycine
(b) glycyl-L-phenylalanyl-L-alanine
(c) L-phenylalanyl-L-alanyl-glycine
(d) L-alanyl-glycyl-L-phenylalanine
13. Identify the single letter code from the following single letter to represent
the structure below
ANSWERS
1. Molecular Modelling is the science that deals with the description
of the atomic and molecular interactions. These interactions govern
microscopic and macroscopic behaviors of physical systems
2. The utility of molecular modelling is to make connections between
the microscopic and the macroscopic world provided by the theory of
statistical mechanics
3. The basic aim of the force fields is to define the empirical potential
energy functions V(r) to model the molecular interactions. These
potential energy functions need to be differentiable for the computations
of the forces acting on each atom: F= –DV(r)
4. At first the theoretical analytical functional forms of the interactions are
derived. For that purpose, the entire system is divided into a number
of atom types which differ by their atomic number and chemical
environment. Thus, the carbons in C=O or C-C are not of the same type
as they are present in different chemical environments. The bonding
parameters, like the bond enthalpy and bond free energy values, are
then determined so as to reproduce the interactions between the various
atom types by fitting procedures. The experimental enthalpy values
of the covalent bonds are used in CHARMM force field. On the other
hand, GROMOS and AMBER force fields are based on experimental
free energies.
5. The topological properties are the descriptions of the covalent
connectivity of the molecules to be modeled.
7. These are the different types of forces acting on the molecule based on
the force-field parameters.
5 Phylogenetic Analysis
5.1 INTRODUCTION
the direction of flow of genetic data among the species. However, the
positioning of the root node in a Phylogenetic tree is based on the
following two assumptions:
1. Addition of the root node as and out-group: In this assumption
a specific molecular information (sequence of DNA, r-RNA or
amino acid sequence of a protein) is chosen in such a way that the
sequence belongs to the same biological family of the sequences
of interest to be used for the Phylogenetic analysis but it is more
distantly related to them. The aforementioned specific molecular
information is then considered as the out-group. The positioning of
the root then simply depends on the sequence similarity measures
between the sequences of interest and the out-group sequence.
2. Addition of the root to the midpoint of the Phylogenetic tree: In
this assumption, it is considered that all the sequences are being
evolved at the same time. The root node in such cases should be
placed at the midpoint of the two longest branches.
From the above figure it is clear that the Phylogenetic analyses have
been performed 100 times with A, B, C and D. Out of 100 such replicates,
80% of times B appeared at the same position as presented in the tree
while the same for A and C are 60% and 100% respectively. Thus, it is
conclusive that C and D are distant neighbours and C is related to B.
All said and done, all the Phylogenetic analyses methods have their
own shortcomings. Among them the major ones are:
(a) Limitations in sequence alignment methodologies
(b) Not being able to differentiate between the differences in rates of
evolutions in the different parts of the same sequence
(c) Not being able to differentiate between the differences in rates of
evolutions of the same sequence in the different species.
Different labs are constantly trying to sort out these problems. A few
recent papers justifying such endeavors are presented in following section:
1. RAxML version 8: a tool for Phylogenetic analysis and
post-analysis of large phylogenies, Alexandros Stamatakis,
Bioinformatics, Volume 30, Issue 9, 1 May 2014, Pages 1312–
1313, https://doi.org/10.1093/bioinformatics/btu033
2. Verdant: automated annotation, alignment and Phylogenetic
analysis of whole chloroplast genomes, Michael R. McKain,
Ryan H. Hartsock, Molly M. Wohl, Elizabeth A. Kellogg,
Bioinformatics, Volume 33, Issue 1, 1 January 2017, Pages
Phylogenetic Analysis 5.15
130–132, https://doi.org/10.1093/bioinformatics/btw583
3. ClockstaR: choosing the number of relaxed-clock models in
molecular Phylogenetic analysis, Sebastián Duchêne, Martyna
Molak, Simon Y. W. Ho, Bioinformatics, Volume 30, Issue
7, 1 April 2014, Pages 1017–1019, https://doi.org/10.1093/
bioinformatics/btt665
4. MulRF: a software package for Phylogenetic analysis using
multi-copy gene trees,
Ruchi Chaudhary, David Fernández-Baca, John Gordon Burleigh,
Bioinformatics, Volume 31, Issue 3, 1 February 2015, Pages
432–433, https://doi.org/10.1093/bioinformatics/btu648
5. Unrealistic Phylogenetic trees may improve Phylogenetic
footprinting, Martin Nettling, Hendrik Treutler, Jesus Cerquides,
Ivo Grosse, Bioinformatics, Volume 33, Issue 11, 1 June 2017,
Pages 1639–1646, https://doi.org/10.1093/bioinformatics/btx033
6. markophylo: Markov chain analysis on Phylogenetic trees,
Utkarsh J. Dang, G. Brian Golding, Bioinformatics, Volume 32,
Issue 1, 1 January 2016, Pages 130–132, https://doi.org/10.1093/
bioinformatics/btv541
From the above figure it can be concluded that human is closely related
to mouse as both of them have a common ancestor expressed by the black
circle. Similarly frogs share a recent ancestor with fish represented by
another circle.
FINAL WORDS
Phylogenetic analyses would involve the use of biological sequences.
The most commonly used sequences are amino acid sequences and DNA
sequences. However, the choice of the biological sequence depends
on the problem. For the evolutionary study of primates mitochondrial
DNA is preferred. To study the relatedness between divergent species,
Ribosomal RNA sequence is generally used. However, the success behind
a good Phylogenetic analysis depends on the accuracy of the sequence
alignment methods. The problem of sequence alignment can be solved by
developing a robust alignment method. It is also to be noted that there is
no guarantee that a Phylogenetic tree would represent a true evolutionary
pathway. However, in order to solve the problem different methods of
tree building may be employed. If the same result is obtained by all the
methods, the Phylogenetic analyses may be considered to represent a
true evolutionary picture. In general, bootstrapping is commonly used
to randomly sample the available data. In an ideal case, the Phylogenetic
trees obtained from bootstrapping analyses should match the original tree.
However, in reality a bootstrapping result of 70% for any given branch in
Phylogenetic tree is considered to provide nearly 95% confidence level
about the correctness of the branch.
Phylogenetic Analysis 5.17
ANSWERS
(a) GenBank.
(b) DDBJ.
(c) EMBL.
(d) TREMBL.
(a) SWISSPROT.
(b) EMBL.
(c) DDBJ.
(d) NCBI.
(a) DDBJ.
(b) PROSITE.
(c) NRDB.
(d) OWL.
QA.2 Introduction to Bioinformatics
(a) PROSITE.
(b) DDBJ.
(c) NRDB.
(d) EMBL.
(a) PDB.
(b) PubChem.
(c) ChemBank.
(d) SCOP.
(a) PubChem.
(b) PDB.
(c) ChemBank.
(d) SCOP.
(a) /.
(b) *.
(c) >.
(d) #.
(a) SwissProt.
(b) GenBank.
(c) UniSTS.
(d) NRDB.
Few More Important Questions and Answers QA.3
(a) PubMed.
(b) Entrez.
(c) PIR.
(d) EBI.
(a) PubMed.
(b) Entrez.
(c) Mozilla.
(d) EBI.
12. PubMed is maintained by
(a) NCBI.
(b) EMBL.
(c) DDBJ.
(d) SWISSPROT.
15. Entrez is maintained by
(a) SWISSPROT.
(b) EMBL.
(c) DDBJ.
(d) NCBI.
16. Find a similarity search tool.
(a) BLAST.
(b) CLUSTALW.
(c) CLUSTALX.
(d) RASMOL.
QA.4 Introduction to Bioinformatics
(d) PIR.
19. Which tool compares protein sequence against protein databases?
(a) blastp.
(b) blastn.
(c) blastx.
(d) tblastx.
20. Which tool compares nucleotide sequence against DNA databases?
(a) blastn.
(b) blastp.
(c) tblastx.
(d) tblastn.
21. Which tool compares translated nucleotide query sequence against
protein databases?
(a) blastp.
(b) tblastn.
(c) blastx
(d) tblastx.
Few More Important Questions and Answers QA.5
(d) tblastn.
23. Which tool compares translated nucleotide query sequence against
translated nucleotide databases?
(a) blastp.
(b) blastn.
(c) tblastx.
(d) tblastn.
24. Which Evalue provides the more significant the hit?
(a) lower. (b) higher.
28. EST is
(a) Expressed Sequence Tag.
(b) Expressed Site Tag.
(c) Expressed Structure Tag.
(d) Expressed Symbol Tag.
29. SNP is
(a) Small Nucleic Polymorphism.
(b) Single Nucleic Polymorphism.
(c) Single Nucleotide Polymorphism.
(d) Small Nucleotide Polymorphism.
30. What is Evalue?
(a) The chance that a random sequence could achieve a better score
than the query.
(b) The chance that a homologous sequence could achieve a similar
score to the query.
(c) The chance that a random sequence could achieve a worse score
than the query.
(d) The chance that a homologous sequence could achieve a better
score than the query.
31. PROSITE is
(a) A database of protein structures.
(b) A database of protein sequences.
(c) A database of protein motifs.
(d) Options a and b.
37. Fingerprint is
(a) A protein family discriminator built from a set of regular
expressions.
(b) A protein family discriminator built from a set of conserved motifs.
(c) A cluster of protein sequences gathered from a BLAST search.
(d) A cluster of protein sequences gathered from a FASTA search.
Few More Important Questions and Answers QA.7
(d) fasta.
50. Comparison of two sequences is called
(a) Global alignment
(b) Local alignment
(c) Pairwise sequence alignment
(d) Multiple sequence alignment
51. Comparison of more than two sequences is called
(a) Global alignment
(b) Local alignment
(c) Pairwise sequence alignment
(d) Multiple sequence alignment
52. Highly similar sequences are aligned by
(a) Global alignment
(b) Local alignment
(c) Pairwise sequence alignment
(d) Multiple sequence alignment
53. An optimum alignment _____ number of matches and ______ the
number of gaps.
Few More Important Questions and Answers QA.9
(a) SMART.
(b) PRINTS.
(a) PfamA.
(b) PfamB.
(a) PfamD.
(b) PfamB.
(a) BLASTN.
(b) BLASTP.
(c) SNPBLAST.
(d) PSIBLAST.
(a) PHIBLAST.
(b) PSIBLAST.
(c) BLASTN.
(d) TBLASTX.
62. The PRINTS database is linked to
(a) SwissProt/TrEMBL.
(b) SwissProt/EMBL.
(c) PIR/TrEMBL.
(d) PIR/EMBL.
63. Dotmatrix is
(a) as the coordinates of a twodimensional graph.
(b) are represented in the form of trees.
(c) as the coordinates of a 3D graph.
(d) not represented as graph.
Few More Important Questions and Answers QA.11
64. PAM is
(a) Parallel Align Mutation.
(b) Point Altered Mutation.
(c) Point Accepted Mutation.
(d) Point Arranged Mutation.
65. Local alignment is done by
(a) Needleman and Wunsch. (b) PAM.
(c) SmithWaterman. (d) All the above.
66. Global alignment is done by
(a) Needleman and Wunsch. (b) Smith
Waterman.
ANSWERS
Jones, D.T., Taylor, W.R., & Thornton, J.M. (1992). A new approach
to protein fold recognition. Nature 358 (6381): 86–89.
Jones, S. & Thornton, J.M. (1997). Analysis of protein-protein
interaction sites using surface patches. J.Mol.Biol. 272(1).
Jones, S. & Thornton, J.M. (1997). Thornton, Prediction of protein-
protein interaction sites using patch analysis. J.Mol.Biol. 272(1).
Kaczanowski, S., & Zielenkiewicz, P. (2010). Why similar protein
sequences encode similar three-dimensional structures?. Theoretical
Chemistry Accounts 125: 643–650.
Kitchen, D.B., Decornez, H., Furr, J.R., & Bajorath, J. (2004).
Docking and scoring in virtual screening for drug discovery: methods
and applications. Nature reviews. Drug discovery3 (11): 935–949.
Koike, A., & Takagi, T. (2004). Prediction of protein–protein
interaction sites using support vector machines. Protein Eng. Des. Sel.
17(2).
Kyte, J (2006). Structure in Protein Chemistry, Second Edition.
Garland Publishing Inc., 19 Union Square West, NY 10003
Leach, A. R. (2001). Molecular Modelling: Principles and
Applications, ISBN 0-582-38210-6.
Lengauer, T., & Rarey, M. (1996). Computational methods for
biomolecular docking. Curr. Opin. Struct. Biol. 6 (3): 402–406.
Lesk, A. (2008). Introduction to Bioinformatics. OUP Oxford; 3
edition, ISBN-10: 0199208042.
ISBN-13: 978-0199208043.
Levitt, M. (1992). Accurate modelling of protein conformation by
automatic segment matching. J Mol Biol 226 (2): 507–33.
Mount, D.M. (2004). Bioinformatics: Sequence and Genome Analysis
2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.
Nelson, D. L., & Cox, M. L. (2008) Principles of Biochemistry 5th
Edition. M. W. H. Freeman; June 15, 2008. ISBN 1429224169.
Further Reading F.5
A Errat, 1.17
E-value, 3.3
Ab Initio Modelling, 4.1
AIC, 5.7 F
Amino Acid, 1.5 Family, 1.5
B FASTA, 2.5
Force Field, 4.14
Bayesian, 4.38
FRET, 4.34
BIC, 5.7
FSSP, 2.18
Bioinformatics, 1.1
BLAST, 1.7 G
BLOSUM, 1.7 GenBank, 2.8
Bootstrapping, 1.11 Global Alignment, 1.7
C GO, 2.18
CATH, 1.5 H
CLASS, 2.16 Hidden Markov Model (HMM), 1.19
Conjugate Gradient (CG), 4.20 HMM, 1.9
Crystallography, 4.33 Homology Modelling, 4.1
D Human Genome Project, 1.2
Dali, 1.5 I
Database, 1.6 Interactions, 4.4
DAVID, 2.18 InterPro, 1.5
DDBJ, 2.3
DNA, 1.2 L
Docking, 4.31 Lincs, 4.24
Domain, 1.3 Local Alignment, 1.7
Dynamic Programming, 1.7 LRT, 5.8
E M
EBI, 2.3 Machine Learning, 1.20
EMBL, 2.3 Maximum Likelihood, 1.10
I.2 Introduction to Bioinformatics