0% found this document useful (0 votes)
92 views

Question Bank (Bioinformatics I)

The document discusses bioinformatics databases and applications. It provides details on nucleotide databases like GenBank and protein databases like PIR and Swiss-Prot. It also describes several applications of bioinformatics in fields like medicine, agriculture and microbial analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

Question Bank (Bioinformatics I)

The document discusses bioinformatics databases and applications. It provides details on nucleotide databases like GenBank and protein databases like PIR and Swiss-Prot. It also describes several applications of bioinformatics in fields like medicine, agriculture and microbial analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

QUESTION BANK (BIOINFORMATICS I)

Prepared by Dr Chandrabhan Seniya


Associate Professor
Department of Biotechnology
Meerut Institute of Engineering and Technology
Email: [email protected],[email protected]

Unit 1

Introduction to Bioinformatics; Biological databases: Nucleotide databases, Protein databases,


Specialized databases; Laboratory data submission and data retrieval; Various file formats for
biomolecular sequences: Genbank, EMBL, FASTA, GCG, msf, nbrf-pir etc.; Basic concepts
of sequence similarity: identity and homology, definitions of homologues, orthologues,
paralogues; Sequence patterns and profiles.

1. Describe applications of bioinformatics in detail.


Answer: Bioinformatics is mainly used to extract knowledge from biological data through the
development of algorithms and software. Bioinformatics is widely applied in the examination of
Genomics, Proteomics, 3D structure modelling of Proteins, Image analysis, Drug designing and a
lot more. A significant application of bioinformatics can be found in the fields of precision and
preventive medicines, which are mainly focused on developing measures to prevent, control and
cure dreadful infectious diseases. The main aim of Bioinformatics is to increase the understanding
of biological processes. Listed below are a few applications of Bioinformatics.
1. In Gene therapy.
2. In Evolutionary studies.
3. In Microbial applications.
4. In Prediction of Protein Structure.
5. For the Storage and Retrieval of Data.
6. In the field of medicine, used in the discovery of new drugs.
7. In Biometrical Analysis for identification and access control for improvising crop
management, crop production and pest control.

2. Explain scope of bioinformatics along with its branches in detail.


Answer: Bioinformatics is an emerging branch of biological science that emerged as a result of
the combination of biology and information technology. It is a multidisciplinary subject where
information technology is incorporated by means of various computational and analytical tools
for the interpretation of biological data. It is a combination of various areas
including Biology, Chemistry, Mathematics, Statistics, and Computer Science. Bioinformatics is
about developing new technologies and software tools in the fields of medicine, biological
research, and biotechnology. The major scope and application of bioinformatics are:
 To understand the function of genes
 Cell organizations and function
 Analysis of drug targets
 Examine the characteristics of various diseases
 Integration and development of various tools for the management of biological databases.

Bioinformatics is a subject that is a combination of biology and technology. It requires


complete knowledge of engineering as well as life sciences. This sector draws from a well of
biological data and uses this information to create new tools and software which will be
possible in the area of biological research. Bioinformatics is subdivided into two sections,
namely,
 Animal bioinformatics
 Plant bioinformatics

Applications of Bioinformatics

Bioinformatics and its application depend on taking out useful facts and figures from a
collection of data reserved to be processed into useful information. Bioinformatics focuses its
scope on the areas of 3D image processing, 3D modelling of living cells, image analysis, drug
development, and a lot more. The most important application of bioinformatics can be seen in
the field of medicine, where its data is used to create antidotes for infectious and harmful
diseases.
The main application of bioinformatics is to make the complete use of natural processes more
usable and less complicated. Some examples of the application of bioinformatics are as follows.
 Bioinformatics is largely used in gene therapy.
 This branch finds application in evolutionary concepts.
 Microbial analysis and computing.
 Understanding protein structure and modeling.
 Storage and retrieval of biotechnological data.
 In the finding of new drugs.
 In agriculture to understand crop patterns, pest control, and crop management.
Applications of Bioinformatics

Bioinformatics Subfields & Related Disciplines


The area of bioinformatics incorporates a wide range of biotechnological sub-disciplines that
are highlighted by both scientific ethics based on biological sciences and deep knowledge of
computer science and information technology. Bioinformatics will grow in scope and utility.
Some of the examples of many fields of bioinformatics include:
 Computational biology: The uses of data-based solutions to the issues in
bioinformatics.
 Genetics: It is the study of heredity and the gene diversity of inherited
characteristics/features.
 Genomics: It is the branch of biomolecular biology that works in the area of structure,
function, evolution, and mapping of genomes.
 Proteomics: The study of proteomes and their features.
 Metagenomics: The study of genetics from the environment and living beings and
samples.
 Transcriptomics: It is the study of the complete RNA and DNA transcriptase.
 Phylogenetics: The study of the relationships between groups of animals and humans.
 Metabolomics: The study of the biochemistry of metabolism and metabolites in living
beings.
 Systems biology: Mathematical designing and analysis and visualization of large sets
of biodata.
 Structural analysis: Modeling that determines the effects of physical loads on physical
structures.
 Molecular modeling: The designing and defining of molecular structures by way of
computational chemistry.
 Pathway analysis: A software description that defines related proteins in the
metabolism of the body.

Scope of Bioinformatics
The main scope of Bioinformatics is to fetch all the relevant data and process it into useful
information. It also deals with –
 Management and analysis of a wide set of biological data.
 It is specially used in human genome sequencing where large sets of data are being
handled.
 Bioinformatics plays a major role in the research and development of the biomedical
field.
 Bioinformatics uses computational coding for several applications that involve finding
gene and protein functions and sequences, developing evolutionary relationships, and
analyzing the three-dimensional shapes of proteins.
 Research works based on genetic disease and microbial disease entirely depend on
bioinformatics, where the derived information can be vital to produce personalized
medicines.

Scope of Bioinformatics

3. List names of two nucleotide databases.


Answer: The Nucleotide database is a collection of sequences from several sources, including
GenBank, RefSeq, TPA and PDB. Genome, gene and transcript sequence data provide the
foundation for biomedical research and discovery.

DDBJ (Japan), GenBank (USA) and European Nucleotide Archive (Europe) are repositories
for nucleotide sequence data from all organisms. All three accept nucleotide sequence
submissions, and then exchange new and updated data on a daily basis to achieve optimal
synchronization between them.
4. Write two names of protein databases.
Answer: A protein database is a collection of data that has been constructed from physical,
chemical and biological information on sequence, domain structure, function, three‐
dimensional structure and protein‐protein interactions. Collectively, protein databases may
form a protein sequence database.

The PRIMARY databases hold the experimentally determined protein sequences inferred
from the conceptual translation of the nucleotide sequences. This, of course, is not
experimentally derived information, but has arisen as a result of interpretation of the nucleotide
sequence information and consequently must be treated as potentially containing
misinterpreted information. There is a number of primary protein sequence databases, and each
requires some specific consideration.
a. Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD):
 The PIR-PSD is a collaborative endeavor between the PIR, the MIPS (Munich
Information Centre for Protein Sequences, Germany) and the JIPID (Japan International
Protein Information Database, Japan).
 The PIR-PSD is now a comprehensive, non-redundant, expertly annotated, object-
relational DBMS.
 A unique characteristic of the PIR-PSD is its classification of protein sequences based
on the superfamily concept.
 The sequence in PIR-PSD is also classified based on homology domain and sequence
motifs.
 Homology domains may correspond to evolutionary building blocks, while sequence
motifs represent functional sites or conserved regions.
 The classification approach allows a more complete understanding of sequence
function-structure relationship.

b. SWISS-PROT
 The other well known and extensively used protein database is SWISS-PROT. Like
the PIR-PSD, this curated proteins sequence database also provides a high level of
annotation.
 The data in each entry can be considered separately as core data and annotation.
 The core data consists of the sequences entered in common single letter amino acid
code, and the related references and bibliography. The taxonomy of the organism from
which the sequence was obtained also forms part of this core information.
 The annotation contains information on the function or functions of the protein, post-
translational modification such as phosphorylation, acetylation, etc., functional and
structural domains and sites, such as calcium binding regions, ATP-binding sites, zinc
fingers, etc., known secondary structural features as for examples alpha helix, beta
sheet, etc., the quaternary structure of the protein, similarities to other protein if any,
and diseases that may arise due to different authors publishing different sequences for
the same protein, or due to mutations in different strains of an described as part of the
annotation.
TrEMBL (for Translated EMBL) is a computer-annotated protein sequence database that is
released as a supplement to SWISS-PROT. It contains the translation of all coding sequences
present in the EMBL Nucleotide database, which have not been fully annotated. Thus, it may
contain the sequence of proteins that are never expressed and never actually identified in the
organisms.

c. Protein Databank (PDB):


 PDB is a primary protein structure database. It is a crystallographic database for the
three-dimensional structure of large biological molecules, such as proteins.
 In spite of the name, PDB archive the three-dimensional structures of not only
proteins but also all biologically important molecules, such as nucleic acid fragments,
RNA molecules, large peptides such as antibiotic gramicidin and complexes of protein
and nucleic acids.
 The database holds data derived from mainly three sources: Structure determined by
X-ray crystallography, NMR experiments, and molecular modelling.

The secondary databases are so termed because they contain the results of analysis of the
sequences held in primary databases. Many secondary protein databases are the result of
looking for features that relate different proteins. Some commonly used secondary databases
of sequence and structure are as follows:
a. PROSITE:
 A set of databases collects together patterns found in protein sequences rather than the
complete sequences. PROSITE is one such pattern database.
 The protein motif and pattern are encoded as “regular expressions”.
 The information corresponding to each entry in PROSITE is of the two forms – the
patterns and the related descriptive text.

b. PRINTS:
 In the PRINTS database, the protein sequence patterns are stored as ‘fingerprints’. A
fingerprint is a set of motifs or patterns rather than a single one.
 The information contained in the PRINT entry may be divided into three sections. In
addition to entry name, accession number and number of motifs, the first section
contains cross-links to other databases that have more information about the
characterized family.
 The second section provides a table showing how many of the motifs that make up the
fingerprint occurs in the how many of the sequences in that family.
 The last section of the entry contains the actual fingerprints that are stored as multiple
aligned sets of sequences, the alignment is made without gaps. There is, therefore, one
set of aligned sequences for each motif.

c. MHCPep:
 MHCPep is a database comprising over 13000 peptide sequences known to bind the
Major Histocompatibility Complex of the immune system.
 Each entry in the database contains not only the peptide sequence, which may be 8 to
10 amino acid long but in addition has information on the specific MHC molecules to
which it binds, the experimental method used to assay the peptide, the degree of activity
and the binding affinity observed , the source protein that, when broken down gave rise
to this peptide along with other, the positions along the peptide where it anchors on the
MHC molecules and references and cross-links to other information.

d. Pfam
 Pfam contains the profiles used using Hidden Markov models.
 HMMs build the model of the pattern as a series of the match, substitute, insert or
delete states, with scores assigned for alignment to go from one state to another.
 Each family or pattern defined in the Pfam consists of the four elements. The first is
the annotation, which has the information on the source to make the entry, the method
used and some numbers that serve as figures of merit.
 The second is the seed alignment that is used to bootstrap the rest of the sequences
into the multiple alignments and then the family.
 The third is the HMM profile.
 The fourth element is the complete alignment of all the sequences identified in that
family.

5. What information is stored in primary database?


Answer: The PRIMARY databases hold the experimentally determined protein sequences
inferred from the conceptual translation of the nucleotide sequences. This, of course, is not
experimentally derived information, but has arisen as a result of interpretation of the nucleotide
sequence information and consequently must be treated as potentially containing
misinterpreted information. There is a number of primary protein sequence databases, and each
requires some specific consideration.

6. Explain secondary database.


Answer: The secondary databases are so termed because they contain the results of analysis
of the sequences held in primary databases. Many secondary protein databases are the result of
looking for features that relate different proteins. Some commonly used secondary databases
of sequence and structure are as follows:
a. PROSITE:
 A set of databases collects together patterns found in protein sequences rather than the
complete sequences. PROSITE is one such pattern database.
 The protein motif and pattern are encoded as “regular expressions”.
 The information corresponding to each entry in PROSITE is of the two forms – the
patterns and the related descriptive text.

b. PRINTS:
 In the PRINTS database, the protein sequence patterns are stored as ‘fingerprints’. A
fingerprint is a set of motifs or patterns rather than a single one.
 The information contained in the PRINT entry may be divided into three sections. In
addition to entry name, accession number and number of motifs, the first section
contains cross-links to other databases that have more information about the
characterized family.
 The second section provides a table showing how many of the motifs that make up the
fingerprint occurs in the how many of the sequences in that family.
 The last section of the entry contains the actual fingerprints that are stored as multiple
aligned sets of sequences, the alignment is made without gaps. There is, therefore, one
set of aligned sequences for each motif.

c. MHCPep:
 MHCPep is a database comprising over 13000 peptide sequences known to bind the
Major Histocompatibility Complex of the immune system.
 Each entry in the database contains not only the peptide sequence, which may be 8 to
10 amino acid long but in addition has information on the specific MHC molecules to
which it binds, the experimental method used to assay the peptide, the degree of activity
and the binding affinity observed , the source protein that, when broken down gave rise
to this peptide along with other, the positions along the peptide where it anchors on the
MHC molecules and references and cross-links to other information.

d. Pfam
 Pfam contains the profiles used using Hidden Markov models.
 HMMs build the model of the pattern as a series of the match, substitute, insert or
delete states, with scores assigned for alignment to go from one state to another.
 Each family or pattern defined in the Pfam consists of the four elements. The first is
the annotation, which has the information on the source to make the entry, the method
used and some numbers that serve as figures of merit.
 The second is the seed alignment that is used to bootstrap the rest of the sequences
into the multiple alignments and then the family.
 The third is the HMM profile.
 The fourth element is the complete alignment of all the sequences identified in that
family.

7. Describe sequence alignment.

8. Write a note on ENTREZ and NCBI.


Answer: ENTREZ is a widely used interface for information retrieval from several linked
databases. It is a text-based search and retrieval system at NCBI for the major databases
including PubMed, Nucleotide and Protein sequences, Protein structures, Complete Genomes,
Taxonomy and others. A search for a gene will give links to different web pages on the genome,
the protein sequence it encodes, its structure and function etc. Some of the features:
1. These are various inter relationships between various databases on ENTREZ and hence
a user can seamlessly access all the information for a particular gene from different
databases at the same time.
2. A key concept in the system is neighboring; neighboring link helps the user to locate
related references or sequences.
3. There are hard links that are very specific connections between entries indifferent
databases.
4. There are hyperlinked articles and relate articles’ hyperlink on the webpage too.

It is a purely a system for accessing and searching the databases at NCBI, It cannot be setup at
one’s server. ENTREZ has access to all NCBI databases, such as PubMed, Books, OMIM,
OMIA, Gene, Genome, Nucleotide etc.

NCBI: To maintain biological data repositories and provide biocomputing services, we must
mention the leading American information provider, the National Center for Biotechnology
Information (NCBI). The NCBI was established in 1988 as a division of the National Library
of Medicine (NLM) and is located on the campus of the National Institute of Health (NIH).
The NLM was chosen to host NCBI because of its experience in biomedical database
maintenance, and because, as part of the NIH, it could establish a research program in
computational biology. The role of the NCBI is to develop new information technologies to aid
our understanding of the molecular and genetic processes that underlie health and disease. Its
specific aims include the creation of automated systems for storing and analyzing biological
information; the development of advanced methods of computer-based information processing;
the facilitation of user access to databases and software; and the co-ordination of efforts to
gather biotechnology information worldwide. Since 1992, one of the principal tasks of the
NCBI has been the maintenance of GenBank, the NIH DNA sequence database. Groups of
annotators create sequence data records from the scientific literature and, together with
information acquired directly from authors, data are exchanged with the international
nucleotide databases, EMBL and DDBJ. At the NCBI the ENTREZ facility (a search tool of
NCBI) was developed to allow retrieval of molecular biology data and bibliographic citations
from NCBI’s integrated databases.

9. How are biological databases categorised on the type of information they contain?
Give example of each category of database.
Answer: Primary databases: contain original biological data. They are archives of raw
sequence or structural data submitted by the scientific community eg. Gen bank, Protein
database (PDB).

Secondary databases: contain computationally processed or manually curated information,


based on original information from primary databases. Eg: Swissprot contains entries of the
PDB in an organized way (for instance, by classification of all PDB entries according to
structures like alpha-helix or ß-sheets) and also information on conserved secondary structure
motifs of a particular protein.

Specialized databases: cater to a particular research interest. Eg: Flybase, HIV sequence
database, and ribosomal database project specialize in a particular organism or a particular type
of data.
10. What information would you obtain from the following databases.
i. UniProt
ii. dbSTS
iii. UniGene
iv. dbHTGS
Answer: UniProt : provides highly curated protein sequence and function information.

dbSTS:database of sequenced tagged sites, STS are short (200 to 500 base pair) DNA sequence
that occurs once in the genome and whose location and base sequence are known.

UniGene: UniGene computationally identifies transcripts from the same locus; analyzes
expression by tissue, age, and health status; and reports related proteins (protEST) and clone
resources

dbHTGS: contains unfinished DNA sequences generated by sequencing.


Unit 2

Sequence Alignment And Database Searching: Introduction, Evolutionary Basis of Sequence


Alignment, Optimal alignment method, Statistical Significance of Alignment. Database
searching Artifacts; Database similarity searching: FASTA, BLAST, Various versions of basic
BLAST and FASTA, Advance version of BLAST: PHI-BLAST and profile-based database
searches using PSIBLAST; Multiple sequence alignment: progressive method and Iterative
method; Applications of pairwise and multiple sequence alignment; Tools for multiple
sequence alignment: CLUSTALW and Pileup (Algorithmic concepts).

1. Write a short note on BLAST.


Answer: BLAST (Basic Local Alignment Search Tool) has become the defacto standard in
search and alignment tools. The BLAST program was developed by Stephen Altschul of NCBI
in 1990 and has since become one of the most popular programs for sequence analysis. The
BLAST algorithm is still actively being developed and is one of the most cited papers ever
written in this field of biology. Many researchers use BLAST as an initial screening of their
sequence data from the laboratory and to get an idea of what they are working on. BLAST is
far from being basic as the name indicates; it is a highly advanced algorithm which has become
very popular due to availability, speed, and accuracy. In short, a BLAST search identifies
homologous sequences by searching one or more databases usually hosted by NCBI
(http://www.ncbi.nlm.nih.gov/), on the query sequence of interest.

BLAST is an open source program and anyone can download and change the program code.
This has also given rise to a number of BLAST derivatives; WU-BLAST is probably the most
commonly used [Altschul and Gish, 1996]. BLAST is highly scalable and comes in a number
of different computer platform configurations which makes usage on both small desktop
computers and large computer clusters possible.
Examples of BLAST usage BLAST can be used for a lot of different purposes. A few of them
are mentioned below.
 Looking for species. If you are sequencing DNA from unknown species, BLAST may
help identify the correct species or homologous species.
 Looking for domains. If you BLAST a protein sequence (or a translated nucleotide
sequence) BLAST will look for known domains in the query sequence.
 Looking at phylogeny. You can use the BLAST web pages to generate a phylogenetic
tree of the BLAST result.
 Mapping DNA to a known chromosome. If you are sequencing a gene from a known
species but have no idea of the chromosome location, BLAST can help you. BLAST
will show you the position of the query sequence in relation to the hit sequences.
 Annotations. BLAST can also be used to map annotations from one organism to another
or look for common genes in two related species.
Searching for homology
Most research projects involving sequencing of either DNA or protein have a requirement for
obtaining biological information of the newly sequenced and maybe unknown sequence. If the
researchers have no prior information of the sequence and biological content, valuable
information can often be obtained using BLAST. The BLAST algorithm will search for
homologous sequences in predefined and annotated databases of the user’s choice. In an easy
and fast way, the researcher can gain knowledge of gene or protein function and find
evolutionary relations between the newly sequenced DNA and well-established data. After the
BLAST search the user will receive a report specifying found homologous sequences and their
local alignments to the query sequence.

Variants of BLAST
 BLAST-N: compares nucleotide sequence with nucleotide sequences
 BLAST-P: compares protein sequences with protein sequences
 BLAST-X: Compares nucleotide sequences against the protein sequences
 tBLAST-N: compares the protein sequences against the six frame translations of
nucleotide sequences
 tBLAST-X: Compares the six frame translations of nucleotide sequence against the six
frame translations of protein sequences.

2. Explain FASTA along with its steps.


Answer: FASTA stands for fast-all” or “FastA”. FASTA is another sequence alignment tool
which is used to search similarities between sequences of DNA and proteins. The query
sequence is broken down into sequence patterns or words known as k-tuples and the target
sequences are searched for these k-tuples in order to find the similarities between the two.
FASTA is a fine tool for similarity searches. When finding sequence similarities, the best way
to conduct your search is to first perform a BLAST search and then go to FASTA. The FASTA
file format is widely used as the input method in other sequence alignment tools like BLAST.
The web interface for FASTA, which is available at the European Bioinformatics Institute
(EBI).

It was the first database similarity search tool developed, preceding the development of
BLAST. FASTA is another sequence alignment tool which is used to search similarities
between sequences of DNA and proteins. FASTA uses a “hashing” strategy to find matches
for a short stretch of identical residues with a length of k. The string of residues is known as
ktuples or ktups, which are equivalent to words in BLAST, but are normally shorter than the
words. Typically, a ktup is composed of two residues for protein sequences and six residues
for DNA sequences. The query sequence is thus broken down into sequence patterns or words
known as k-tuples and the target sequences are searched for these k-tuples in order to find the
similarities between the two. FASTA is a fine tool for similarity searches.

Working of BLAST and FASTA


 FASTA and BLAST are the software tools used in bioinformatics. Both BLAST and
FASTA use a heuristic word method for fast pairwise sequence alignment.
 It works by finding short stretches of identical or nearly identical letters in two
sequences. These short strings of characters are called words.
 The basic assumption is that two related sequences must have at least one word in
common.
 By first identifying word matches, a longer alignment can be obtained by extending
similarity regions from the words.
 Once regions of high sequence similarity are found, adjacent high-scoring regions can
be joined into a full alignment.

3. Differentiate between local and global alignment. Which alignment you would
prefer for highly similar sequences.
Answer: Global alignment
 Sequences are assumed to be generally similar over their entire length. Alignment is
carried out to find best possible alignment over entire length of sequence
 Method works best for closely related sequences of approximately the same length
 Works less well for divergent sequences or sequences of very different lengths.
Local alignment
 Looks for local regions of similarity and aligns these regions without regard for the rest of
the sequence.
 Method good for aligning more divergent sequences with goal of searching for conserved
patterns in DNA or protein sequence.

4. Describe at least 3 types of sequence alignment methods.


Answer:

5. Compare FASTA and BLAST.


Answer:
BLAST and FASTA are two similarity searching programs that identify homologous DNA
sequences and proteins based on the excess sequence similarity. The excess similarity between
two DNA or amino acid sequences arises due to the common ancestry-homology. The most
effective similarity searching is the comparing of amino acid sequence of proteins rather than
DNA sequences. Both BLAST and FASTA use a scoring strategy in order to compare two
sequences and provide highly accurate statistical estimates about the similarities between
sequences. The main difference between BLAST and FASTA is that BLAST is mostly
involved in finding of ungapped, locally optimal sequence alignments whereas FASTA is
involved in finding similarities between less similar sequences.

Difference Between BLAST and FASTA

Definition
BLAST: BLAST is an algorithm for comparing primary biological sequence information like
nucleotide or amino acid sequences.
FASTA: FASTA is a DNA and protein sequence alignment software package.
Stands for
BLAST: BLAST stands for Basic Local Alignment Search Tool.
FASTA: FASTA is short of “fast-all” or “FastA”.
Global/Local Alignment
BLAST: BLAST uses local sequence alignment.
FASTA: FASTA uses local sequence alignment first and then it extends the similarity search
to global alignment.
Local Sequence Alignment
BLAST: BLAST searches similarities in local alignment by comparing individual residues in
the two sequences.
FASTA: FASTA searches similarities in local alignments by comparing sequence patterns or
words.
Type of Search
BLAST: BLAST is better for similarity searching in closely matched or locally optimal
sequences.
FASTA: FASTA is better for similarity searching in less similar sequences.
Type of Work
BLAST: BLAST works best for protein searches.
FASTA: FASTA works best for nucleotide searches.
Gaps in Query Sequence
BLAST: In BLAST, gaps between query and target sequences are not allowed.
FASTA: In FASTA, gaps are allowed.
Sensitivity
BLAST: BLAST is a sensitive bioinformatics tool.
FASTA: FASTA is more sensitive than BLAST.
Speed
BLAST: BLAST is speedier than FASTA.
FASTA: FASTA is less speedy toll when compared to BLAST.
Developers
BLAST: BLAST was designed by Stephen Altschul, Webb Miller, Warren Gish, Eugene
Myers and David J. Lipman at the National Institute of Health in 1990.
FASTA: FASTA was developed by David J. Lipman and William R. Pearson in 1985.
Significance
BLAST: At present, BLAST is the most widely used bioinformatics tools for similarity
searches.
FASTA: The legacy of FASTA is the FASTA format, which is now ubiquitous in
bioinformatics.
6. Differentiate primary and secondary database.
Answer: Database is a collection of related data arranged in a way suitable for adding, locating,
removing and modifying the data
I. Primary database
1. It is also known as archival database
2. Databases consisting of data derived experimentally such as nucleotide sequences and
three-dimensional structures are known as primary databases.
3. It contains the original experimental results are directly submitted into database by
researchers across the globe
4. Primary database has high levels of redundancy or duplication of data
5. Example: Gen bank, DDBJ, PDB

II. Secondary database


1. It is also known as curated database or derived database
2. Databases consisting of data derived from the analysis of primary data such as
nucleotide sequences, protein structures etc
3. It contains results of analysis of primary databases and significant data in the form of
conserved sequences, signature sequences, active site residues of proteins etc.
4. Secondary database has low levels of redundancy or duplication of data as it is curated

7. Why Dynamic Programming is better than other methods?


Answer: The technique of dynamic programming can be applied to produce global alignments
via the Needleman-Wunsch algorithm, and local alignments via the Smith-Waterman
algorithm. In typical usage, protein alignments use a substitution matrix to assign scores to
amino-acid matches or mismatches, and a gap penalty for matching an amino acid in one
sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix, but in
practice often simply assign a positive match score, a negative mismatch score, and a negative
gap penalty. (In standard dynamic programming, the score of each amino acid position is
independent of the identity of its neighbors, and therefore base stacking effects are not taken
into account. However, it is possible to account for such effects by modifying the algorithm.)
Dynamic programming can be useful in aligning nucleotide to protein sequences, a task
complicated by the need to take into account frameshift mutations (usually insertions or
deletions). The framesearch method produces a series of global or local pairwise alignments
between a query nucleotide sequence and a search set of protein sequences, or vice versa.
Although the method is very slow, its ability to evaluate frameshifts offset by an arbitrary
number of nucleotides makes the method useful for sequences containing large numbers of
indels, which can be very difficult to align with more efficient heuristic methods. In practice,
the method requires large amounts of computing power or a system whose architecture is
specialized for dynamic programming. The BLAST and EMBOSS suites provide basic tools
for creating translated alignments (though some of these approaches take advantage of side-
effects of sequence searching capabilities of the tools). More general methods are available
from both commercial sources, such as FrameSearch, distributed as part of the Accelrys GCG
package, and Open Source software such as Genewise.
The dynamic programming method is guaranteed to find an optimal alignment given a
particular scoring function; however, identifying a good scoring function is often an empirical
rather than a theoretical matter. Although dynamic programming is extensible to more than two
sequences, it is prohibitively slow for large numbers of or extremely long sequences. The
dynamic programming is better than other methods because of certain reasons as follows.
1. Dynamic Programming is used to obtain the optimal solution.
2. In Dynamic Programming, we choose at each step, but the choice may depend on the
solution to sub-problems.
3. Less efficient as compared to a greedy approach.
4. Example: 0/1 Knapsack.
5. It is guaranteed that Dynamic Programming will generate an optimal solution using
Principle of Optimality.

8. Contrast local and global sequence alignment with suitable example.


Answer: Sequence alignment is the procedure of comparing two (pairwise alignment) or more
multiple sequences by searching for a series of individual characters or patterns that are in the
same order in the sequences.

Global Sequence Alignment Local Sequence Alignment


In global alignment, an attempt is made to
Finds local regions with the highest level of
align the entire sequence (end to end
similarity between the two sequences.
alignment).
A local alignment aligns a substring of the
A global alignment contains all letters from
query sequence to a substring of the target
both the query and target sequences.
sequence.
Any two sequences can be locally aligned as
If two sequences have approximately the local alignment finds stretches of sequences
same length and are quite similar, they are with high level of matches
suitable for global alignment. without considering the alignment of rest of
the sequence regions.
Suitable for aligning two closely related Suitable for aligning more divergent
sequences. sequences or distantly related sequences.
Global alignments are usually done for
Used for finding out conserved patterns in
comparing homologous genes like
DNA sequences or conserved domains or
comparing two genes with same function (in
motifs in two proteins.
human vs. mouse) or comparing two
proteins with similar function.
A general global alignment technique is the A general local alignment method is Smith–
Needleman–Wunsch algorithm. Waterman algorithm.
Examples of Global alignment tools: Examples of Local alignment tools:
 EMBOSS Needle  BLAST
 Needleman-Wunsch Global  EMBOSS Water
Align Nucleotide Sequences  LALIGN
(Specialized BLAST)

9. For protein alignment, the situation is far more complex, support your
statement.
Answer:

10. Explain transversion and transition with example.


Answer: DNA substitution mutations are of two types. Transitions are interchanges of two-
ring purines (A G), or of one-ring pyrimidines (C T): they therefore involve bases of
similar shape. Transversions are interchanges of purine for pyrimidine bases, which therefore
involve exchange of one-ring & two-ring structures.

Although there are twice as many possible transversions, because of the molecular
mechanisms by which they are generated, transition mutations are generated at higher
frequency than transversions. As well, transitions are less likely to result in amino acid
substitutions (due to "wobble") and are therefore more likely to persist as "silent substitutions"
in populations as single nucleotide polymorphisms (SNPs).

Transition versus Transversion mutations

11. Discuss a scoring matrix used for sequence alignment.


Answer: Scoring matrices are used to determine the relative score made by matching two
characters in a sequence alignment. These are usually log-odds of the likelihood of two
characters being derived from a common ancestral character. There are many flavors of scoring
matrices for amino acid sequences, nucleotide sequences, and codon sequences, and each is
derived from the alignment of "known" homologous sequences. These alignments are then
used to determine the likelihood of one character being at the same position in the sequence as
another character.

Position-specific scoring matrices: A PSSM is defined as a table that contains probability


information of amino acids or nucleotides at each position of an ungapped multiple sequence
alignment. The matrix resembles the substitution matrices, but is more complex in that it
contains positional information of the alignment. In such a table, the rows represent residue
positions of a particular multiple alignment and the columns represent the names of residues
or vice versa (Fig. 1).
Fig. 1. Example of construction of a PSSM from a multiple alignment of nucleotide sequences. The process
involves counting raw frequencies of each nucleotide at each column position, normalization of the frequencies
by dividing positional frequencies of each nucleotide by overall frequencies and converting the values to log odds
scores.

The values in the table represent log odds scores of the residues calculated from the multiple
alignment. To construct a matrix, raw frequencies of each residue at each column position from
a multiple alignment are first counted. The frequencies are normalized by dividing positional
frequencies of each residue by overall frequencies so that the scores are length and composition
independent. The values are converted to the probability values by taking to the logarithm
(normally to the base of 2). In this way, the matrix values become log odds scores of residues
occurring at each alignment position. In this matrix, a positive score represents identical residue
or similar residue match; a negative score represents a non-conserved sequence match.

This constructed matrix can be considered a distilled representation for the entire group of
related sequences, providing a quantitative description of the degree of sequence conservation
at each position of a multiple alignment. The probabilistic model can then be used like a single
sequence for database searching and alignment or can be used to test how well a particular
target sequence fits into the sequence group. For example, given the matrix shown in Figure
1,which is derived from a DNA multiple alignment, one can ask the question, how well does
the new sequence AACTCG fit into the matrix? To answer the question, the probability values
of the sequence at respective positions of the matrix can be added up to produce the sum of the
scores (Fig. 2).

Fig. 2. Example of calculation of how well a new sequence fits into the PSSM produced in Fig. 1. The matching
positions for the new sequence AACTCG are circled in the matrix.

In this case, the total match score for the sequence is 6.33. Because the matrix
values have been taken to the logarithm to the base of 2, the score can be interpreted as the
probability of the sequence fitting the matrix as 26.33 , or 80 times more likely than by random
chance. Consequently, the new sequence can be confidently classified as a member of the
sequence family.

The probability values in a PSSM depend on the number of sequences used to compile the
matrix. Because the matrix is often constructed from the alignment of an insufficient number
of closely related sequences, to increase the predictive power of the model, a weighting scheme
similar to the one used in the Clustal algorithm is used that down weights overrepresented,
closely related sequences and upweights underrepresented and divergent ones, so that more
divergent sequences can be included. Application of such a weighting scheme makes the matrix
less biased and able to detect more distantly related sequences.
12. Why sequence comparison is important?
Answer: Sequence alignment is the procedure of comparing two (pair-wise alignment) or more
multiple sequences by searching for a series of individual characters or patterns that are in the
same order in the sequences.
 Pair-wise alignment: compare two sequences
 Multiple sequence alignment: compare more than two sequences

Pairwise alignment can be either global or local


In global alignment, an attempt is made to align the entire sequence.
 If two sequences have approximately the same length and are quite similar, they are
suitable for the global alignment.
 Suitable for aligning two closely related sequences

Local alignment concentrates on finding stretches of sequences with high level of matches
 Finds local regions with the highest level of similarity between the two sequences and
aligns these regions without considering the alignment of rest of the sequence regions
 Suitable for aligning more divergent sequences
 Used for finding out conserved patterns in DNA or protein sequences

By finding similarities between sequences


1. Scientists can infer the function of newly sequenced genes by aligning the newly
sequenced genes with sequences already in the database
2. Predict new members of gene families
3. Discovering evolutionary relationships or reconstruction of phylogeny
-To find whether two (or more) genes or proteins are evolutionarily related to each
other in closely related species
4. It can be used to predict the location and function of protein-coding and transcription
regulation regions in genomic DNA. Regulatory regions in genome are often
conserved therefore presence of such conserved regions easily tells us about the
regulatory sites in the newly sequenced genes
5. To find structurally or functionally similar regions within proteins

13. Align the following two sequences using Needleman -Wunsch method of pairwise
sequence alignment. Using -2 as a gap penalty, -3 as a mismatch penalty, and 2 as
the score for a match.
ACTGATTCA
ACGCATCA
Answer:

14. Enlist sequence alignment programs.


Answer: For Pairwise alignment
Name Description Sequence Alignment Author Year
type type
AlignMe Alignments for Protein Both M. Stamm, K. 2013
membrane protein Khafizov, R.
sequences Staritzbichler, L.R.
Forrest
ALLALIGN For DNA, RNA and Both Local E. Wachtel 2017
protein molecules up
to 32MB, aligns all
sequences of size K or
greater. Similar
alignments are
grouped together for
analysis. Automatic
repetitive sequence
filter.
Bioconductor Biostrings:: Dynamic Both Both + P. Aboyoun 2008
pairwise Alignment programming Ends-free
CUDAlign DNA sequence Nucleotide Local, E. Sande 2011-
alignment of SemiGlobal, 2015
unrestricted size in Global
single or multiple
GPUs
DNADot Web-based dot-plot Nucleotide Global R. Bowen 1998
tool
FEAST Posterior based local Nucleotide Local A. K. Hudek and D. 2010
extension with G. Brown
descriptive evolution
model
G-PAS GPU-based dynamic Both Local, W. Frohmberg, M. 2011
programming with SemiGlobal, Kierzynka et al.
backtracking Global
GapMis Does pairwise Both SemiGlobal K. Frousios, T. 2012
sequence alignment Flouri, C. S.
with one gap Iliopoulos, K. Park,
S. P. Pissis, G.
Tischler
Genome Magician Software for ultra-fast DNA Local, Hepperle D 2020
local DNA sequence SemiGlobal, (www.sequentix.de)
motif search and Global
pairwise alignment for
NGS data (FASTA,
FASTQ).
LALIGN Multiple, non- Both Local non- W. Pearson 1991
overlapping, local overlapping
similarity (same
algorithm as SIM)
NW-align Standard Needleman- Protein Global Y Zhang 2012
Wunsch dynamic
programming
algorithm
mAlign modelling alignment; Nucleotide Both D. Powell, L. 2004
models the Allison and T. I.
information content of Dix
the sequences
matcher Waterman-Eggert Both Local I. Longden 1999
local alignment (based (modified from W.
on LALIGN) Pearson)
NW Needleman- Both Global A.C.R. Martin 1990-
Wunsch dynamic 2015
programming
PyMOL "align" command Protein Global (by W. L. DeLano 2007
aligns sequence & selection)
applies it to structure
SEQALN Various dynamic Both Local or M.S. Waterman and 1996
programming global P. Hardy
stretcher Memory- Both Global I. Longden 1999
optimized Needleman- (modified from G.
Wunsch dynamic Myers and W.
programming Miller)
tranalign Aligns nucleic acid Nucleotide NA G. Williams 2002
sequences given a (modified from B.
protein alignment Pearson)
water Smith-Waterman Both Local A. Bleasby 1999
dynamic programming
wordmatch k-tuple pairwise match Both NA I. Longden 1998

For multiple sequence alignment


Name Description Sequenc Alignm Author Year License
e type ent type
ALLALIG For DNA, RNA and protein Both Local E. 2017 Free
N molecules up to 32MB, Wachtel
aligns all sequences of size
K or greater, MSA or within
a single molecule. Similar
alignments are grouped
together for analysis.
Automatic repetitive
sequence filter.
ClustalW Progressive alignment Both Local or Thompson 1994 Free, LGPL
global et al.
CodonCod Multi-alignment; ClustalW Nucleoti Local or P. 2003
e Aligner & Phrap support des global Richterich (lates
et al. t
versi
on
2009)
DECIPHE Progressive-iterative Both Global Erik S. 2014 Free, GPL
R alignment Wright
DNA Multi-alignment; Full Nucleoti Local or Heracle 2006 Commercial (some
Baser automatic sequence des global BioSoft (lates modules are
Sequence alignment; Automatic SRL t freeware)
Assembler ambiguity correction; versi
Internal base caller; on
Command line seq 2018)
alignment
FAMSA Progressive alignment for Protein Global Deorowic 2016 Free, GPL 3
extremely large protein z et al.
families (hundreds of
thousands of members)
Geneious Progressive-Iterative Both Local or A.J. 2005
alignment; ClustalW plugin global Drummon (lates
d et al. t
versi
on
2017)
MAFFT Progressive-iterative Both Local or K. 2005 Free, BSD
alignment global Katoh et
al.
MegAlign Software to align DNA, Both Local or DNASTA 1993-
Pro RNA, protein, or DNA + global R 2016
(Lasergen protein sequences via
e pairwise and multiple
Molecular sequence alignment
Biology) algorithms including
MUSCLE, Mauve, MAFFT,
Clustal Omega, Jotun Hein,
Wilbur-Lipman, Martinez
Needleman-Wunsch,
Lipman-Pearson and
Dotplot analysis.
Multi- Progressive dynamic Both Global M. 2003
LAGAN programming alignment Brudno et
al.
MUSCLE Progressive-iterative Both Local or R. Edgar 2004
alignment global
Opal Progressive-iterative Both Local or T. 2007
alignment global Wheeler (lates
and J. t
Kececiogl stable
u 2013,
latest
beta
2016)
Probalign Probabilistic/consistency Protein Global Roshan 2006 Free, public
with partition function and domain
probabilities Livesay
ProbCons Probabilistic/consistency Protein Local or C. Do et 2005 Free, public
global al. domain
Stemloc Multiple alignment and RNA Local or I. Holmes 2005 Free, GPL 3 (parte
secondary structure global de DART)
prediction
T-Coffee More sensitive progressive Both Local or C. 2000 Free, GPL 2
alignment global Notredam (new
e et al. est
versi
on
2008)
UGENE Supports multiple Both Local or UGENE 2010 Free, GPL 2
alignment with MUSCLE, global team (new
KAlign, Clustal and MAFF est
T plugins versi
on
2020)
VectorFrie VectorFriends Both Local or BioFriend 2013 Proprietary, freew
nds Aligner, MUSCLE plugin, global s team are for academic
and ClustalW plugin use
15. Describe various forms of BLAST used in bioinformatics.
Answer: BLAST (basic local alignment search tool) will compare your DNA sequence with
other sequences in the database. For this example, let’s see what the closest human gene is, so
keep the database on the human genome. There are five different types of BLAST programs.
For a protein database use the blastp and blastx programs and for a nucleotide database use the
blastn, tblastn, and tblastx programs.

Types of BLAST
Nucleotide-nucleotide BLAST (blastn) – This program, given a DNA query, returns the most
similar DNA sequences from the DNA database that the user specifies.

Protein-protein BLAST (blastp) – This program, given a protein query, returns the most
similar protein sequences from the protein database that the user specifies.

Position-Specific Iterative BLAST (PSI- BLAST) (blastpgp) – This program is used to find
distant relatives of a protein.

Nucleotide 6-frame translation-protein (blastx) - This program compares the six-frame


conceptual translation products of a nucleotide query sequence (both strands) against a protein
sequence database.

Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx) - The purpose of


tblastx is to find very distant relationships between nucleotide sequences.

Protein-nucleotide 6-frame translation (tblastn) - This program compares a protein query


against all six reading frames of a nucleotide sequence database.

Large numbers of query sequences (megablast) - When comparing large numbers of input
sequences via the command-line BLAST, “megablast” is much faster than running BLAST
multiple times.

Of these programs, BLASTn and BLASTp are the most commonly used because they use direct
comparisons, and do not require translations. However, since protein sequences are better
conserved evolutionarily than nucleotide sequences, tBLASTn, tBLASTx, and BLASTx,
produce more reliable and accurate results when dealing with coding DNA.

BLAST Program Further details


nucleotide blast or Compares a nucleotide query sequence against a nucleotide sequence
blastn database.
protein blast or Compares an amino acid query sequence against a protein sequence
blastp database.
Compares a nucleotide query sequence translated in all reading frames
blastx against a protein sequence database. You could use this option to find
potential translation products of an unknown nucleotide sequence.
Compares a protein query sequence against a nucleotide sequence
tblastn
database dynamically translated in all reading frames.
Compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence database.
tblastx Please note that the tblastx program cannot be used with the nr
database on the BLAST Web page because it is computationally
intensive

16. Compare two given sequences using dotmatrix.


Seq1: GATTCTATCTAACTA and Seq 2: GTTCTATTCTAAC

17. Describe Needleman-Wunsch algorithm and Smith-Waterman algorithm.


Answer:

18. Consider the two following alignments to compute best sequence match:
V I T K LGT CV GS V I T K LGT CV G S
V T K G T C V S V I T    T CV GS
Match = 3 Gap open penalty = -4 Gap extension penalty = -1
Unit 3

Scoring Matrices: Basic concept of a scoring matrix, Similarity and distance matrix,
Substitution matrices: Matrices for nucleic acid and proteins sequences, PAM and
BLOSUM series, Principles based on which these matrices are derived and Gap Penalty;
Predictive Method using Nucleotide Sequence: Introduction, Marking repetitive DNA,
Database search, Codon bias detection, detecting functional site in DNA.

1. Explain gap penalty parameters.


Answer: Gap penalties are used during sequence alignment. Gap penalties contribute to
the overall score of alignments, and therefore, the size of the gap penalty relative to the
entries in the similarity matrix affects the alignment that is finally selected. Selecting a
higher gap penalty will cause less favourable characters to be aligned, to avoid creating as
many gaps as possible.

Constant gap penalty


Constant gap penalties are the simplest type of gap penalty. The only parameter, d, is added
to the alignment score when the gap is first opened. This means that any gap, receives the
same penalty, whatever size it is.

Linear gap penalty


Linear gap penalties have only parameter, d, which is a penalty per unit length of gap. This
is almost always negative, so that the alignment with fewer gaps is favoured over the
alignment with more gaps. Under a linear gap penalty, the overall penalty for one large gap
is the same as for many small gaps.

Affine gap penalty


Affine gap penalties attempt to overcome this problem. In biological sequences, for
example, it is much more likely that one big gap of length 10 occurs in one sequence, due
to a single insertion or deletion event, than it is that 10 small gaps of length 1 are made.
Therefore, affine gap penalties are length dependent (unlike linear gap penalties which are
length independent) and use a gap opening penalty, o, and a gap extension penalty, e. A
gap of length l is then given a penalty o + (l-1)e. So that gaps are discouraged, o and e are
almost always negative. Furthermore, because a few large gaps are better than many small
gaps, e is almost always smaller than o to encourage gap extension rather than gap
introduction.

2. Describe similarity matrix.


Answer: A similarity matrix is a matrix of scores which express the similarity between two
data points. Similarity matrices are strongly related to their counterparts, distance
matrices and substitution matrices. Similarity of Matrix Representations is related to
Diagonal Matrix Representation.

Similarity matrices are used in sequence alignment. Higher scores are given to more-similar
characters, and lower or negative scores for dissimilar characters.
Nucleotide similarity matrices are used to align nucleic acid sequences. Because there are
only four nucleotides commonly found in DNA (Adenine (A), Cytosine (C), Guanine (G)
and Thymine (T)), nucleotide similarity matrices are much simpler than protein similarity
matrices. For example, a simple matrix will assign identical bases a score of +1 and non-
identical bases a score of −1. A more complicated matrix would give a higher score to
transitions (changes from a pyrimidine such as C or T to another pyrimidine, or from a
purine such as A or G to another purine) than to transversions (from a pyrimidine to a purine
or vice versa). The match/mismatch ratio of the matrix sets the target evolutionary distance
(States et al. 1991 METHODS - A companion to Methods in Enzymology 3:66-70); the
+1/−3 DNA matrix used by BLASTN is best suited for finding matches between sequences
that are 99% identical; a +1/−1 (or +4/−4) matrix is much more sensitive as it is optimal
for matches between sequences that are about 70% identical.

Amino acid similarity matrices are more complicated, because there are 20 amino acids
coded for by the genetic code. Therefore, the similarity matrix for amino acids contains 400
entries (although it is usually symmetric). The first approach scored all amino acid changes
equally. A later refinement was to determine amino acid similarities based on how many
base changes were required to change a codon to code for that amino acid. This model is
better, but it doesn't take into account the selective pressure of amino acid changes. Better
models took into account the chemical properties of amino acids.

One approach has been to empirically generate the similarity matrices. The Dayhoff method
used phylogenetic trees and sequences taken from species on the tree. This approach has
given rise to the PAM series of matrices. PAM matrices are labelled based on how many
nucleotide changes have occurred, per 100 amino acids. While the PAM matrices benefit
from having a well understood evolutionary model, they are most useful at short
evolutionary distances (PAM10 - PAM120). At long evolutionary distances, for example
PAM250 or 20% identity, it has been shown that the BLOSUM matrices are much more
effective.

The BLOSUM series were generated by comparing a number of divergent sequences. The
BLOSUM series are labeled based on how much entropy remains unmutated between all
sequences, so a lower BLOSUM number corresponds to a higher PAM number.

3. Discuss the biological significance of multiple sequence alignment.


Answer: Multiple sequence alignment (MSA) has assumed a key role in comparative
structure and function analysis of biological sequences. It often leads to fundamental
biological insight into sequence-structure-function relationships of nucleotide or protein
sequence families. MSA may refer to the process or the result of sequence alignment of
three or more biological sequences, generally protein, DNA, or RNA. In many cases, the
input set of query sequences are assumed to have an evolutionary relationship by which they
share a linkage and are descended from a common ancestor. From the resulting MSA,
sequence homology can be inferred and phylogenetic analysis can be conducted to assess
the sequences' shared evolutionary origins. Visual depictions of the alignment as in the
image at right illustrate mutation events such as point mutations (single amino
acid or nucleotide changes) that appear as differing characters in a single alignment column,
and insertion or deletion mutations (indels or gaps) that appear as hyphens in one or more
of the sequences in the alignment. Multiple sequence alignment is often used to assess
sequence conservation of protein domains, tertiary and secondary structures, and even
individual amino acids or nucleotides.

Computational algorithms are used to produce and analyse the MSAs due to the difficulty
and intractability of manually processing the sequences given their biologically relevant
length. MSAs require more sophisticated methodologies than pairwise alignment because
they are more computationally complex. Most multiple sequence alignment programs
use heuristic methods rather than global optimization because identifying the optimal
alignment between more than a few sequences of moderate length is prohibitively
computationally expensive. On the other hand, heuristic methods generally fail to give
guarantees on the solution quality, with heuristic solutions shown to be often far below the
optimal solution on benchmark instances.

4. Describe Gap Penalty. Why are gap penalties introduced during sequence
alignment? What are affine gap penalties?
Answer: A Gap penalty is a method of scoring alignments of two or more sequences. Gap
penalties are used during sequence alignment. Gap penalties contribute to the overall score
of alignments, and therefore, the size of the gap penalty relative to the entries in
the similarity matrix affects the alignment that is finally selected. Selecting a higher gap
penalty will cause less favourable characters to be aligned, to avoid creating as many gaps
as possible. When aligning sequences, introducing gaps in the sequences can allow an
alignment algorithm to match more terms than a gap-less alignment can. However,
minimizing gaps in an alignment is important to create a useful alignment. Too many gaps
can cause an alignment to become meaningless. Gap penalties are used to adjust alignment
scores based on the number and length of gaps. The five main types of gap penalties are
constant, linear, affine, convex, and profile based.

Gap penalties increase the quality of an alignment non-homologous sequences are not
aligned. Permitting the insertion of arbitrarily many gaps might lead to high scoring
alignments of nonhomologous sequences. Penalizing gaps forces alignments to have
relatively few gaps. Affine gap penalty has both constant and proportional contributions.
Constant gap penalty is independent of length of gap A proportional penalty is proportional
to the length of the gap. So Affine gap penalty include penalty for opening a gap as well as
extending the gap.

5. Give the example of functional sites in DNA.


Answer: DNA binding sites are a type of binding site found in DNA where other
molecules may bind. DNA binding sites are distinct from other binding sites in that (1)
they are part of a DNA sequence (e.g. a genome) and (2) they are bound by DNA-binding
proteins. DNA binding sites are often associated with specialized proteins known
as transcription factors, and are thus linked to transcriptional regulation. The sum of DNA
binding sites of a specific transcription factor is referred to as its cistrome. DNA binding
sites also encompasses the targets of other proteins, like restriction enzymes, site-specific
recombinases (see site-specific recombination) and methyltransferases.
DNA binding sites can be thus defined as short DNA sequences (typically 4 to 30 base
pairs long, but up to 200 bp for recombination sites) that are specifically bound by one or
more DNA-binding proteins or protein complexes. It has been reported that some binding
sites have potential to undergo fast evolutionary change.

6. Describe significance of databases search.


Answer: Databases are online collections of resources that you can search to find information.
They may cover a particular subject area or cover a range of subjects. Most databases:
 have a peer reviewed or scholarly material filter to ensure you get reliable, authoritative
information
 offer advanced search features that allow you to focus your search.
Database search can also be used.
 To find out if a new DNA sequence shares similarities with sequences already deposited
in the databanks.
 To find proteins homologous to a putative coding ORF.
 To find similar non‐coding DNA stretches in the database, (for example: repeat
elements, regulatory sequences).
 To locate false priming sites for a set of PCR oligonucleotides.

7. Discuss Codon bias problem.


Answer: A similar genetic code is used by most organisms on Earth, but different organisms
have different preferences for the codons they use to encode specific amino acids. This is
possible because there are 4 bases (A, T, C, and G) and 3 positions in each codon. There are
therefore 64 possible codons but only 20 amino acids and 3 stop codons to encode leaving 41
codons unaccounted for. The result is redundancy; multiple codons encode single amino acids.
Evolutionary constraints have molded which codons are used preferentially in which organisms
- organisms have codon usage bias. You can find many codon tables showing which codons
encode which amino acids (see table given below). With such simple rules, you might think
it’s easy to come up with a workable DNA sequence to encode your peptide of interest and
produce that peptide in your organism of choice. Unfortunately, codon preferences make it so
you cannot choose among the possible codons at random and expect your sequence to express
well in any organism.
The reasons for varied codon preferences among organisms aren’t completely understood, but
some possible reasons include:
1. Metabolic pressures - it takes cellular resources to produce tRNAs that recognize
different codons, modify the tRNAs correctly, and charge the tRNAs with the
appropriate amino acids. If an organism uses only a subset of codons, it only needs to
produce a subset of charged tRNAs and therefore may need fewer resources for the
entire translation process. For example, during high growth rate conditions, E.
coli preferentially upregulates production of tRNAs that recognize codons found in
highly expressed genes.
2. Controlling gene expression through gene sequence - Proteins that are encoded by
codons with low abundance or poorly charged tRNAs may be produced at a lower rate
than proteins encoded by highly abundant, charged tRNAs.
3. Protein folding - If a protein is encoded by a mixture of codons with highly and poorly
charged tRNAs, different regions of the protein may be translated at different rates. The
ribosome will move quickly along regions calling for abundant, charged tRNAs but will
stall at regions calling for low abundance, poorly charged tRNAs. When the ribosome
stalls, this may give the swiftly translated regions a chance to fold properly.
Adaptation to changing conditions - Organisms often need to express genes at different
levels under different conditions. With varied codon usage, an organism can change
which proteins are highly expressed and which are poorly expressed by producing and
charging specific tRNA pools. For example, tRNAs used in genes encoding amino acid
biosynthetic enzymes may be preferentially charged during amino acid starvation thus
resulting in higher production of amino acid biosynthetic.

8. Differentiate RefSeq and GenBank Databases.


Answer: The RefSeq collection is derived from the primary submissions available in
GenBank. GenBank is a redundant archival database that represents sequence
information generated at different times, and may represent several alternate views of
the protein, names, or other information.

RefSeq is limited to major organisms for which sufficient data are available (121,461
distinct "named" organisms as of July 2022), while GenBank includes sequences for
any organism submitted (approximately 504,000 formally described species).

9. Consider following two sequences to produce an optimal alignment using Dynamic


Programming. Seq1: VITKLGTCVGS and Seq2 : VTKGTCVS

10. Differentiate PAM and BLOUSM. Which PAM matric would you use to align
highly similar sequences?
Answer: Substitution matrices are used to score aligned positions in a sequence
alignment procedure, usually of amino acids or nucleotide sequences. In addition to BLOSUM
matrices, a previously developed scoring matrix can be used for sequence similarity analysis.
This is known as a PAM. The two results in the same scoring outcome but use differing
methodologies. BLOSUM looks directly at mutations in motifs of related sequences while
PAM's extrapolate evolutionary information based on closely related sequences.
Since both PAM and BLOSUM are different methods for showing the same scoring
information, the two can be compared but due to the very different method of obtaining this
score, a PAM100 does not equal a BLOSUM100.

The relationship between PAM and BLOSUM


PAM BLOSUM
PAM matrices are used to score BLOSUM matrices are used to score alignments
alignments between closely related between evolutionarily divergent protein
protein sequences. sequences.
To compare distantly related proteins,
To compare distantly related proteins, BLOSUM
PAM matrices with high numbers are
matrices with low numbers are created.
created.
Based on global alignments Based on local alignments
Alignments have high similarity than Alignments have low similarity than PAM
BLOSUM alignments alignments
Mutations in global alignments are
based on highly conserved stretches of alignments
very significant
Higher numbers in the PAM matrix Higher numbers in the BLOSUM matrix naming
naming denotes greater evolutionary denotes higher sequence similarity and smaller
distance evolutionary distance
Example: PAM 250 is used for more Example: BLOSUM 80 is used for closely related
distant sequences than PAM 120 sequences than BLOSUM 62

The differences between PAM and BLOSUM


PAM BLOSUM
Based on global alignments of closely related
Based on local alignments.
proteins.
PAM1 is the matrix calculated from comparisons BLOSUM 62 is a matrix calculated from
of sequences with no more than 1% divergence but comparisons of sequences with a pairwise
corresponds to 99% sequence identity. identity of no more than 62%.
Based on observed alignments; they are not
Other PAM matrices are extrapolated from PAM1. extrapolated from comparisons of closely
related proteins.
Larger numbers in matrices naming scheme
Higher numbers in matrices naming scheme denote
denote higher sequence similarity and
larger evolutionary distance.
therefore smaller evolutionary distance.

11. Compute the matching position of nucleotides using PSSM for the given
sequences:
Seq1: ATGTCG, Seq2: AAGACT, Seq3: TACTCA, Seq4: CGGAGG, and Seq5:
AACCTG

Figure 1. Example of construction of a PSSM from a multiple alignment of nucleotide sequences. The process
involves counting raw frequencies of each nucleotide at each column position, normalization of the frequencies
by dividing positional frequencies of each nucleotide by overall frequencies and converting the values to log odds
scores.
Figure 2. Example of calculation of how well a new sequence fits into the PSSM produced in Figure 1. The
matching positions for the new sequence AACTCG are circled in the matrix.

Unit 4

Phylogenetics: Phylogeny and concepts in molecular evolution; nature of data used in


taxonomy and phylogeny; definition and description of Phylogenetic trees and various types
of trees; Different methods of Phylogenetic tree construction: UPGMA and Fitch-
Margoliash Algorithm; case studies in phylogenetic sequence analysis.

1. Discuss concept in molecular evolution of phylogeny.


Answer: Molecular phylogenetics is the study of evolutionary relationships among
organisms using molecular sequence data. The aim of this review is to introduce the
important terminology and general concepts of tree reconstruction to biologists who lack a
strong background in the field of molecular evolution. Some modern phylogenetic programs
are easy to use because of their user-friendly interfaces, but understanding the phylogenetic
algorithms and substitution models, which are based on advanced statistics, is still important
for the analysis and interpretation without a guide. Briefly, there are five general steps in
carrying out a phylogenetic analysis: (1) sequence data preparation, (2) sequence alignment,
(3) choosing a phylogenetic reconstruction method, (4) identification of the best tree, and
(5) evaluating the tree. Concepts in this review enable biologists to grasp the basic ideas
behind phylogenetic analysis and help provide a sound basis for discussions with expert
phylogeneticists.

2. Explain Cladogram and Dendrogram.


Answer: A cladogram is a diagrammatic representation which shows the relationship of
the closely related organisms. It is a type of a phylogenetic tree. But it only shows the
relationships between clades with the common ancestor. As an example, a cladogram shows
human are more loosely related with chimpanzees than gorilla, but it does not show the
evolutionary time and the exact distance from the common ancestor.

Cladogram is a tree-like diagram which is drawn using lines. The nodes of a cladogram
represent the splitting of two groups from a common ancestor. Clades are summarized at
the ends of the lines and the members of a particular clade share similar characteristics.
Clades are built using molecular differences instead of morphological characteristics.
However, cladograms can be constructed using the correct morphological and behavioral
data as well.
Figure. A Primate Cladogram

A dendrogram, or phylogenetic tree, is a branching diagram or “tree” showing the


evolutionary history between biological species or other entities based on their genetic
characteristics. Species or entities joined together by nodes represent descendants from a
common ancestor and are more similar genetically. This figure shows a hypothetical example
of a rooted dendrogram, wherein the horizontal position of individuals represents the genetic
distance from a specific progenitor. With the advancement of DNA sequencing technologies,
phylogenetic trees have been used widely in infectious disease control to depict the genetic
similarities and differences between strains and variants of a certain disease pathogen.
Knowing whether infectious diseases occurring in different areas are from the same strain
provides key information on the source of infection and how the disease may be transmitted.
Interactive features of these visualizations may include the ability to collapse or color and label
branches.

Figure. A dendogram

3. Explain Phylogenetic analysis.


Answer: Phylogenetic analysis provides an in-depth understanding of how species evolve
through genetic changes. Using phylogenetics, scientists can evaluate the path that connects
a present-day organism with its ancestral origin, as well as can predict the genetic divergence
that may occur in the future.
Phylogenetics has many applications in medical and biological fields, including forensic
science, conservation biology, epidemiology, drug discovery and drug design, prediction of
protein structure and function, and gene function prediction.

A more accurate estimation of the evolutionary relationship among species is now possible
in a molecular phylogenetic analysis using gene sequencing data. Also, the Linnaean
classification (based on relatedness in obvious physical traits) of newly evolved species can
be done using molecular phylogenetic analysis.

Regarding public health applications, molecular phylogenetic analysis can be employed to


gather information about pathogen outbreaks. A possible source of pathogen transmission
can be investigated by analysing the epidemiological linkage between genetic sequences of
a pathogen, such as HIV.

Moreover, phylogenetics can be used to evaluate the reciprocal evolutionary interaction


between microorganisms, as well as to identify mechanisms (horizontal gene transfer)
responsible for the rapid adaptation of pathogens in an ever-changing host
microenvironment.

Phylogenetic analysis can be useful in comparative genomics, which studies the relationship
between genomes of different species. In this context, one major application is gene
prediction or gene finding, which means locating specific genetic regions along a genome.

Phylogenetic screening of pharmacologically related species can help identify closely


related members of a species with pharmacological significance.

In microbiology, phylogenetic analysis can be applied to identify and classify various


microorganisms, including bacteria.

4. Explain the basic difference between taxonomy and phylogeny.


Answer: Taxonomy and phylogeny are two concepts involved in the classification of
organisms. Taxonomy is a branch of biology that concerns the naming and classifying
organisms based on their similarities and dissimilarities in their characteristics. Phylogeny
is the branch of science which concerns the evolutionary relationship of a species or a group
of species with a common ancestor. Thus, the key difference between taxonomy and
phylogeny is that taxonomy involves naming and classifying organisms while phylogeny
involves the evolution of the species or groups of species. Phylogeny is important in
taxonomy.

Difference Between Taxonomy and Phylogeny


Taxonomy vs Phylogeny
Taxonomy is the field of biology that classifies Phylogeny is the evolutionary
living and extinct organisms according to a set of history of a species or group of
rules. species.
Main Concern
Taxonomy concerns naming and classifying Phylogeny concerns evolutionary
organisms. relationships of organisms.
Shared Evolutionary History

Taxonomy does not reveal anything about the Phylogeny reveals the shared
shared evolutionary history of organisms. evolutionary history.

5. Describe molecular clock hypothesis.


Answer: A clock or a watch serves as a useful method for dating back past, current, and future
timescales about life events and history throughout the eras, but what about evolutionary
timescales? How are these timescales measured in terms of species divergence and evolution
on a molecular level?

A different type of clock is used, a Molecular Clock. Instead of measuring seconds, minutes,
and hours, the molecular clock measures the constant rate of change in an organism's genome
(DNA or protein sequences of a specific gene) over time. This constant rate of change occurs
randomly among different species of organisms such as animals, plants, fungi, and viruses. By
measuring these changes, scientists can then create phylogenetic trees representing a species
that evolved or diverged from another long ago.

The molecular clock hypothesis states that DNA and protein sequences evolve at a rate that is
relatively constant over time and among different organisms. The molecular clock is a
figurative term for a technique that uses the mutation rate of biomolecules to deduce the
time in prehistory when two or more life forms diverged. The biomolecular data used for such
calculations are usually nucleotide sequences for DNA, RNA, or amino acid sequences
for proteins. The benchmarks for determining the mutation rate are often fossil or
archaeological dates. The molecular clock was first tested in 1962 on the hemoglobin protein
variants of various animals and is commonly used in molecular evolution to estimate times
of speciation or radiation. It is sometimes called a gene clock or an evolutionary clock that
explain importance of phylogenetics. The notion of the existence of a so-called "molecular
clock" was first attributed to Émile Zuckerkandl and Linus Pauling who, in 1962, noticed that
the number of amino acid differences in hemoglobin between different lineages changes
roughly linearly with time, as estimated from fossil evidence. They generalized this
observation to assert that the rate of evolutionary change of any specified protein was
approximately constant over time and over different lineages (known as the molecular clock
hypothesis).

When an organism inherits genetic material from the previous generation, the change occurs
steadily, and the genes are said to be neutral. They are neither disadvantageous nor
advantageous, meaning they do not inhibit natural selection or fitness but are rather due
to genetic drift. Different genes containing different nucleotide substitutions are studied to
determine the rate at which the sequences of the genes have been evolving. This occurrence
happens over a timeframe of millions of years as the genes are passed down and altered from
one generation to the next.
The molecular clock measures the number of random mutations of an organism's gene (DNA
or protein sequences) at a relatively constant rate over a specific timeframe. It is calibrated with
fossil records and geological timescales. It measures how long-ago different organisms were
on Earth and when the divergence of a new species (animal, plant, virus, fungi) happened.

6. Explain importance of phylogenetics.


Answer: Phylogenetics is important because it enriches our understanding of how genes,
genomes, species (and molecular sequences more generally) evolve. Through phylogenetics,
we learn not only how the sequences came to be the way they are today, but also general
principles that enable us to predict how they will change in the future. This is not only of
fundamental importance but also extremely useful for numerous applications (Figure).

Figure. Potential applications of phylogenetics

Classification: Phylogenetics based on sequence data provides us with more accurate


descriptions of patterns of relatedness than was available before the advent of molecular
sequencing. Phylogenetics now informs the Linnaean classification of new species.

Forensics: Phylogenetics is used to assess DNA evidence presented in court cases to inform
situations, e.g., where someone has committed a crime, when food is contaminated, or where
the father of a child is unknown.

Identifying the origin of pathogens: Molecular sequencing technologies and phylogenetic


approaches can be used to learn more about a new pathogen outbreak. This includes finding
out about which species the pathogen is related to and subsequently the likely source of
transmission. This can lead to new recommendations for public health policy.
Conservation: Phylogenetics can help to inform conservation policy when conservation
biologists have to make tough decisions about which species, they try to prevent from
becoming extinct.

Bioinformatics and computing: Many of the algorithms developed for phylogenetics have
been used to develop software in other fields.

7. Describe the relationships of phylogenetic analysis and sequences analysis.


Answer: In phylogenetic analysis, branching diagrams are made to represent the evolutionary
history or relationship between different species, organisms, or characteristics of an organism
(genes, proteins, organs, etc.) that are developed from a common ancestor.

The diagram is known as a phylogenetic tree. Phylogenetic analysis is important for gathering
information on biological diversity, genetic classifications, as well as learning developmental
events that occur during evolution.

With advancements in genetic sequencing techniques, phylogenetic analysis now involves the
sequence of a gene to understand the evolutionary relationships among species. DNA being the
hereditary material can now be sequenced easily, rapidly, and cost-effectively, and the data obtained
from genetic sequencing is very informative and specific.

Also, morphological estimates can be used to infer evolutionary developments, especially in cases
where genetic material is not available (fossils).

When 2 sequences found in 2 organisms are very similar, we assume that they have derived
from one ancestor.

The sequences alignment reveal which positions are conserved from the ancestor sequence.
The progressive multiple alignment of a group of sequences, first aligns the most similar
pair.
Then it adds the more distant pairs.
The alignment is influenced by the “most similar” pairs and arranged accordingly, but it
does not always correctly represent the evolutionary history of the occurred changes.
Not all phylogenetic methods work this way.
Most phylogenetic methods assume that each position in a sequence can change
independently from the other positions.
Gaps in alignments represent mutations in sequences such as: insertion, deletion, genetic
rearrangements
Gaps are treated in various ways by the phylogenetic methods. Most of them ignore gaps.
Another approach to treat gaps is by using sequences similarity scores as the base for the
phylogenetic analysis, instead of using the alignment itself, and trying to decide what
happened at each position.
The similarity scores based on scoring matrices (with gaps scores) are used by the
DISTANCE methods.

8. Explain a phylogenetic tree.


Answer: Phylogenetic studies are useful for finding answers to different problems in
evolutionary biology such as the relationship between species and their origin, spread of
viral infections, migration patterns of species, etc. Advanced molecular biological
techniques have helped biologists to evaluate phylogenetic relationships between organisms
in relation to the evolutionary changes of the organisms. A phylogenetic tree is a diagram
which shows the relationship between organisms based on their characteristics, genetic
background, and evolutionary relationships. Compared to a cladogram, phylogenetic tree
has more value when discussing the relationships of organisms in a meaningful way with
respect to their ancestors and evolution. Phylogenetic tree is drawn like a branching tree
diagram in which branch length is proportional to the evolutionary distance, unlike a
cladogram.

Biologists analyze different characteristics of organisms using different analytical tools such
as parsimony, distance, likelihood and bayesian methods, etc. They consider many
characteristics of organisms including morphological, anatomical, behavioral, biochemical,
molecular and fossil characteristics to construct phylogenetic trees.

Figure. A phylogenetic tree

9. Differentiate rooted tree and unrooted tree.


Answer: The main difference between rooted and unrooted phylogenetic tree is that
the rooted phylogenetic trees show the ancestry relationship, whereas the unrooted
phylogenetic trees only show the relatedness of organisms. Furthermore, each descendant
of the rooted phylogenetic tree has a most recent common ancestor, while unrooted
phylogenetic tree does not show the ancestral root. In overall, rooted and unrooted
phylogenetic tree are two types of phylogenetic trees that describe the relationships
among organisms.

A rooted phylogenetic tree is a type of phylogenetic tree that describes the ancestry of a
group of organisms. Importantly, it is a directed tree, starting from a unique node known as
the recent common ancestor. Basically, the roots of the phylogenetic tree describe this recent
common ancestor.
Figure 1: A Rooted Phylogenetic Tree
However, this recent common ancestor is an extra and distantly related organism to the group
of organisms used to build up the phylogenetic tree. But it serves as the parent of all organisms
in the group.

Unrooted Phylogenetic Tree


The unrooted phylogenetic tree is a type of phylogenetic tree that only describes the relatedness
of a group of organisms. Importantly, the leaf nodes of this type of phylogenetic tree only show
relatedness, not the ancestry. Hence, it does not start with the recent common ancestor and does
not contain a root.

Figure 2: An Unrooted Phylogenetic Tree

However, the same data in the rooted phylogenetic tree can be used to generate an unrooted
phylogenetic tree as well. It is created by omitting the root.

Similarities Between Rooted and Unrooted Phylogenetic Tree


 Rooted and unrooted phylogenetic trees are two types of phylogenetic trees, describing
the relationships among organisms.
 Moreover, both types of trees give an idea about the relatedness of organisms.
 Additionally, both of them contain leaf nodes and internal nodes.
Difference Between Rooted and Unrooted Phylogenetic Tree

10. Describe UGMA method for phylogenetic analysis.


Answer: UGMA stands for Unweighted Pair Group Method with Arithmetic Mean. It is the
simplest method of tree construction. It was originally developed for constructing taxonomic
phenograms, i.e., trees that reflect the phenotypic similarities between OTUs, but it can also be
used to construct phylogenetic trees if the rates of evolution are approximately constant among
the different lineages. For this purpose, the number of observed nucleotide or amino-acid
substitutions can be used. UPGMA employs a sequential clustering algorithm, in which local
topological relationships are identified in order of similarity, and the phylogenetic tree is built
in a stepwise manner.

The simplest clustering method is UPGMA, which builds a tree by a sequential clustering
method. Given a distance matrix, it starts by grouping two taxa with the smallest pairwise
distance in the distance matrix. A node is placed at the midpoint or half distance between them.
It then creates a reduced matrix by treating the new cluster as a single taxon. The distances
between this new composite taxon and all remaining taxa are calculated to create a reduced
matrix. The same grouping process is repeated and another newly reduced matrix is created.
The iteration continues until all taxa are placed on the tree. The last taxon added is considered
the outgroup producing a rooted tree.

The basic assumption of the UPGMA method is that all taxa evolve at a constant rate and that
they are equally distant from the root, implying that a molecular clock is in effect. However,
real data rarely meet this assumption. Thus, UPGMA often produces erroneous tree topologies.
However, owing to its fast speed of calculation, it has found extensive usage in clustering
analysis of DNA microarray data.

11. Discuss Fitch-Margoliash algorithm for phylogenetic analysis.


Answer: Fitch-Margoliash algorithm for phylogenetic analysis is the first algorithm based on
least squares principle for phylogenetic tree reconstruction. It was developed by Walter Fitch
and Emanuel Margoliash in 1967. The evolutionary distances between the taxa are determined
by the Jukes–Cantor model when DNA sequences (instead of distances) of the same length are
entered.

The algorithm is based on optimality criteria that select the tree with a minimum amount of
residual (difference between actual and expected summed evolutionary distance). The
algorithm estimates the total branch length (distance) and clusters in accordance with taxa pair
in order to determine the unrooted tree with minimum distance.
1. The FM algorithm does not assume a constant rate of evolution, which is quite realistic.
2. Optimized tree can be selected out of several possible trees.
3. Its demerit is underestimation of very long evolutionary distances, because it ignores
homoplasies (absence of a character in the common ancestor, though it is being shared
by a group of related species originating from the common ancestor).
4. It ignores the role of intermediate ancestor(s); hence, consistency of evolution is not the
basic assumption.
5. Outgroup is added to the sequences in order to generate a rooted tree using the FM
method.

12. Describe progressive method of multiple sequence alignment.


Answer: Progressive alignment depends on the stepwise assembly of multiple alignment and
is heuristic in nature. It speeds up the alignment of multiple sequences through a multistep
process. It first conducts pairwise alignments for each possible pair of sequences using the
Needleman–Wunsch global alignment method and records these similarity scores from the
pairwise comparisons. The scores can either be percent identity or similarity scores based on a
particular substitution matrix. Both scores correlate with the evolutionary distances between
sequences. The scores are then converted into evolutionary distances to generate a distance
matrix for all the sequences involved. A simple phylogenetic analysis is then performed based
on the distance matrix to group sequences based on pairwise distance scores. As a result, a
phylogenetic tree is generated using the neighbor-joining method. The tree reflects
evolutionary proximity among all the sequences.

It needs to be emphasized that the resulting tree is an approximate tree and does not have the
rigor of a formally constructed phylogenetic tree. Nonetheless, the tree can be used as a guide
for directing realignment of the sequences. For that reason, it is often referred to as a guide
tree. According to the guide tree, the two most closely related sequences are first re-aligned
using the Needleman–Wunsch algorithm. To align additional sequences, the two already
aligned sequences are converted to a consensus sequence with gap positions fixed. The
consensus is then treated as a single sequence in the subsequent step.
In the next step, the next closest sequence based on the guide tree is aligned with the consensus
sequence using dynamic programming. More distant sequences or sequence profiles are
subsequently added one at a time in accordance with their relative positions on the guide tree.
After realignment with a new sequence using dynamic programming, a new consensus is
derived, which is then used for the next round of alignment. The process is repeated until all
the sequences are aligned as shown below.

Figure. Schematic of a typical progressive alignment procedure (e.g., Clustal). Angled wavy lines represent
consensus sequences for sequence pairs A/B and C/D. Curved wavy lines represent a consensus for A/B/C/D.
Probably the most well-known progressive alignment program is Clustal. Some of its important
features are introduced next. Clustal (www.ebi.ac.uk/clustalw/) is a progressive multiple
alignment program available either as a stand-alone or on-line program. The stand-alone
program, which runs on UNIX and Macintosh, has two variants, ClustalW and ClustalX. The
W version provides a simple text-based interface, and the X version provides a more user-
friendly graphical interface.

Steps:
1. Start with the most similar sequence.
2. Align the new sequence to each of the previous sequences.
3. Create a distance matrix/function for each sequence pair.
4. Create a phylogenetic “guide tree” from the matrices, placing the sequences at
the terminal nodes.
5. Use the guide tree to determine the next sequence to be added to the alignment.
6. Preserve gaps.
7. Go back to step 1.

Progressive MSA is one of the fastest approaches, considerably faster than the adaptation
of pair-wise alignments to multiple sequences, which can become a very slow process for more
than a few sequences.

One major disadvantage, however, is the reliance on a good alignment of the first two
sequences. Errors there can propagate throughout the rest of the MSA. An alternative approach
is iterative MSA (see below).

Software used for MSA


 Clustal-W - the famous Clustal-W multiple alignment program
 Clustal-X - provides a window-based user interface to the Clustal-W multiple
alignment program
 DCSE - a multiple alignment editor
 Jalview - a Java multiple alignment editor
 Musca - multiple sequence alignment of amino acid or nucleotide sequences; uses
pattern discovery
 MUSCLE - more accurate than T-Coffee, faster than Clustal-W.

13. Explain various methods of phylogenetic tree construction.


Answer: Phylogenetics is the study of genetic relatedness of individuals of the same, or
different, species. Through phylogenetics, evolutionary relationships can be inferred. A
phylogenetic tree may be rooted or unrooted, depending on whether the ancestral root is known
or unknown, respectively. A phylogenetic tree’s root is the origin of evolution of the
individuals studied. Branches between leaves show the evolutionary relationships between
sequences, individuals, or species, and branch length represents evolutionary time. When
constructing and analyzing phylogenetic trees, it is important to remember that the resulting
tree is simply an estimate and is unlikely to represent the true evolutionary tree of life. Various
methods can be used to construct a phylogenetic tree. The two most used and most robust
approaches are maximum likelihood and Bayesian methods.

Distance-matrix methods
Distance-matrix methods are rapid approaches that measure the genetic distances
between sequences. After having aligned the sequences through multiple sequence alignment,
the proportion of mismatched positions is calculated. From this, a matrix is constructed that
describes the genetic distance between each sequence pair. In the resulting phylogenetic tree,
closely related sequences are found under the same interior node, and the branch lengths
represent the observed genetic distances between sequences.

Neighbor joining (NJ), and Unweighted Pair Group Method with Arithmetic Mean (UPGMA)
are two distance-matrix methods. NJ and UPGMA produce unrooted and rooted trees,
respectively. NJ is a bottom-up clustering algorithm. The main advantages of NJ are its rapid
speed, regardless of dataset size, and that it does not assume an equal rate of evolution amongst
all lineages. Despite this, NJ only generates one phylogenetic tree, even when there are several
possibilities. UPGMA is an unweighted method, meaning that each genetic distance
contributes equally to the overall average. UPGMA is not a particularly popular method as it
makes various assumptions, including the assumption that the evolutionary rate is equal for all
lineages.

Maximum parsimony
Maximum parsimony attempts to reduce branch length by minimizing the number of
evolutionary changes required between sequences. The optimal tree would be the shortest tree
with the fewest mutations. All potential trees are evaluated, and the tree with the least amount
of homoplasy, or convergent evolution, is selected as the most likely tree. Since the most-
parsimonious tree is always the shortest tree, it may not necessarily best represent the
evolutionary changes that have occurred. Also, maximum parsimony is not statistically
consistent, leading to issues when drawing conclusions.

Maximum likelihood
Despite being slow and computationally expensive, maximum likelihood is the most used
phylogenetic method used in research papers, and it is ideal for phylogeny construction from
sequence data. For each nucleotide position in a sequence, the maximum likelihood algorithm
estimates the probability of that position being a particular nucleotide, based on whether the
ancestral sequences possessed that specific nucleotide. The cumulative probabilities for the
entire sequence are calculated for both branches of a bifurcating tree. The likelihood of the
whole tree is provided by the sum of the probabilities of both branches of the tree. Maximum
likelihood is based on the concept that each nucleotide site evolves independently, enabling
phylogenetic relationships to be analyzed at each site. The maximum likelihood method can be
carried out in a reasonable amount of time for four sequences. If more than four sequences are
to be analyzed, then basic trees are constructed for the initial four sequences, and further
sequences are subsequently added, and maximum likelihood is recalculated. Bias can be
introduced into the calculation as the order in which the sequences are added, and the initial
sequence used, play pivotal roles in the outcome of the tree. Bias can be avoided by repeating
the entire process multiple times at random so that the majority rule consensus tree can be
selected.

Bayesian inference
Bayesian inference methods assume phylogeny by using posterior probabilities of phylogenetic
trees. A posterior probability is generated for each tree by combining its prior probability with
the likelihood of the data. A phylogeny is best represented by the tree with the highest posterior
probability. Not only does Bayesian inference produce results that can be easily interpreted; it
can also incorporate prior information and complex models of evolution into the analysis, as
well as accounting for phylogenetic uncertainty.

14. Produce a phylogenetic tree using UPGMA for the given matrix.
Unit 5

Protein identification based on composition, Physical properties based on sequence, Motif and
pattern, Secondary structure (Statistical method: Chou Fasman and GOR method, Neural
Network and Nearest neighbor method) and folding classes, specialized structure or features,
Tertiary structures (Homology Modeling); Structure visualization methods (RASMOL,
CHIME etc.); Protein Structure alignment and analysis. Application of bioinformatics in drug
discovery and drug designing.

1. Write a short note on protein structure prediction methods.


Answer: Protein structure prediction is the prediction of the three-dimensional structure of a
protein from its amino acid sequence — that is, the prediction of its folding and its secondary,
tertiary, and quaternary structure from its primary structure. Structure prediction is
fundamentally different from the inverse problem of protein design. Protein structure
prediction is one of the most important goals pursued by bioinformatics and theoretical
chemistry; it is highly important in medicine (for example, in drug design) and biotechnology
(for example, in the design of novel enzymes).

There are three major theoretical methods for predicting the structure of proteins: comparative
modelling, fold recognition, and ab initio prediction.

Comparative modelling
Comparative modelling exploits the fact that evolutionarily related proteins with similar
sequences, as measured by the percentage of identical residues at each position based on an
optimal structural superposition, have similar structures. The similarity of structures is very
high in the so-called ``core regions'', which typically are comprised of a framework of
secondary structure elements such as alpha-helices and beta-sheets. Loop regions connect these
secondary structures and generally vary even in pairs of homologous structures with a high
degree of sequence similarity.

Fold recognition or "threading"


Threading uses a database of known three-dimensional structures to match sequences without
known structure with protein folds. This is accomplished by the aid of a scoring function that
assesses the fit of a sequence to a given fold. These functions are usually derived from a
database of known structures and generally include a pairwise atom contact and solvation
terms. Threading methods compare a target sequence against a library of structural templates,
producing a list of scores. The scores are then ranked and the fold with the best score is assumed
to be the one adopted by the sequence. The methods to fit a sequence against a library of folds
can be extremely elaborate computationally, such as those involving double dynamic
programming, dynamic programming with frozen approximation, Gibbs Sampling using a
database of ``threading'' cores, and branch and bound heuristics, or as ``simple'' as using
sophisticated sequence alignment methods such as Hidden Markov Models.

Ab initio prediction
The ab initio approach is a mixture of science and engineering. The science is in understanding
how the three-dimensional structure of proteins is attained. The engineering portion is in
deducing the three-dimensional structure given the sequence. The biggest challenge with
regards to the folding problem is with regards to ab initio prediction, which can be broken down
into two components: devising a scoring function that can distinguish between correct (native
or native-like) structures from incorrect (non-native) ones, and a search method to explore the
conformational space. In many ab initio methods, the two components are coupled together
such that a search function drives, and is driven by, the scoring function to find native-like
structures.

Currently there does not exist a reliable and general scoring function that can always drive a
search to a native fold, and there is no reliable and general search method that can sample the
conformation space adequately to guarantee a significant fraction of near natives (< 3.0
angstroems RMSD from the experimental structure).

Some methods for ab initio prediction include Molecular Dynamics (MD) simulations of
proteins and protein-substrate complexes provide a detailed and dynamic picture of the nature
of inter-atomic interactions with regards to protein structure and function; Monte Carlo (MC)
simulations that do not use forces but rather compare energies, via the use of Boltzmann
probabilities; Genetic Algorithms which tries to improve on the sampling and the convergence
of MC approaches, and exhaustive and semi-exhaustive lattice-based studies which are based
on using a crude/approximate fold representation (such as two residues per lattice point) and
then exploring all or large amounts of conformational space given the crude representation.

2. Explain Structure visualization methods for protein structure.


Answer: The protein structure can be visualised using various standalone and web applications
given below:
UCSF Chimera is a highly extensible program for interactive
visualization and analysis of molecular structures and related data,
Chimera including density maps, supramolecular assemblies, sequence
alignments, docking results, trajectories, and conformational
ensembles.
DeepView Swiss- DeepView Swiss PDB Viewer user-friendly molecular visualization
PDB Viewer program. It includes in addition various building and modeling tools.
DS (Discovery Studio) Visualizer is a cut-down version of a
Discovery Studio
commercial DS viewer/editor-program sold by Accelrys, which is free
Visualizer
and very user-friendly.
Jmol is an open-source molecular viewer written in Java. It can be used
Jmol as applet which displays the molecule within a web browser or as
standalone application.
MGLTools is a software for visualization and analysis of molecular
structures. It incude three main application: ADT, a graphical front-
MGLTools
end for setting up and running AutoDock; PMV, molecular viewer and
Vision, a visual-programming environment.
PyMOL is open-source molecular graphics system with an embedded
PyMOL Python interpreter designed for real-time visualization and rapid
generation of high-quality molecular graphics images and animations.
VMD (Visual Molecular Dynamics) is a free molecular visualization
VMD program for displaying, animating, and analyzing large bimolecular
systems using 3-D graphics and built-in scripting.
Cn3D is an open-source code visualization tool for biomolecular
structures, sequences, and sequence alignments. Cn3D is typically
Cn3D
run from a web browser as a helper application for NCBI's Entrez
system.
Friend is a bioinformatics application designed for simultaneous
Friend analysis and visualization of multiple structures and sequences of
proteins and/or DNA/RNA.
Jmol is an open source molecular viewer written in Java. It can be
Jmol used as applet which displays the molecule within a web browser or
as standalone application.

3. Write down application of bioinformatics in drug discovery and drug designing.


Answer: Bioinformatics is the study of analysis of biological data using computer
programming, mathematics and statistics. The most important achievement of bioinformatics
is the human genome project which was mapped in 2001. Most of the work in this field is based
on the research. Scientists make databases and develop different tools to analyze different gene
expressions. As biological data of an organism is in the raw form and it is very time consuming
to know the genes whose function has to be determined, so bioinformatics has successfully
explored the information of the functions of the genes which was impossible to extract by other
means. This field has great influence on other fields also like biotechnology, environment,
medicine, agriculture and human health.

Bioinformatics has made it possible to sequence the genome of various organisms and there
are almost hundred organisms whose genome has been mapped so far. The databases of these
organisms are increasing in size day by day because every day they get new information about
any organism. As other fields of science have made use of bioinformatics, similarly,
pharmaceutical industry has also adopted bioinformatics and genomics for the drug targets and
drug discovery.

The understanding of molecular biology has made it possible to design and develop the drugs.
In the beginning, whole animals were used to test the synthetic organic molecules in the organ
preparation. In the recent years, bioinformatics has made it easy for the researchers that they
can now easily target the molecules in the in vitro environment. Now the screening of newly
developed compounds can be done against the molecules of the proteins or genetically
modified cells and it gives very efficient results. This way of drug development has eased the
identification of the disease in an organism.
Today the pharmaceutical industries are capable of developing such drugs which are able to
target 500 gene products. A gene product is the biochemical material which is produced from
the gene expression. It is either RNA or protein. Now scientists have mapped the whole genome
of humans which consists of 30,000 to 40,000 genes that is why it has provided the scientists
to discover more drugs by observing the gene expression. Mapping of the whole human
genome has enabled the scientists to have more drug targets but still it is a big challenge to get
the target which will be show successful results. It is the science of bioinformatics which
focuses the attention of the scientists on the target validation instead of target identification.

There are a lot of factors which should be observed during the drug target like, nucleotide and
protein sequences, functional prediction, mapping information, data of gene and protein
expression. Bioinformatics tools help in collecting the information about all these factors.
These tools accumulate the information in the form of databases. These databases save time,
money and efforts of the researchers and save the information in the form of groups and
subgroups. The information in the databases also gives the knowledge of molecules.

There are two ways through which pharmaceutical industries develop drugs, either broad
spectrum drugs or narrow spectrum drugs. Broad spectrum drugs can be used in the form of
antibiotics to kill bacteria which have pathogenic characteristics. But the positive point is that
these drugs do not harm the human health. Role of bioinformatics in the drug target is in this
sense that now DNA and protein chips are used. The chips are very helpful in determining the
expression of proteins or DNA in cells.

The clustering algorithms present in bioinformatics tools help the researchers to make
comparison of the gene expression data and distinguish the diseased cells from healthy ones.
These algorithms also enable the researchers to observe the function of the gene or protein
during the process of disease.

Bioinformatics is a very helpful field as it makes such tools which provide the full information
about the proteins and gene families and researchers can use this information for various
applications for example drug target and drug discovery.

4. How we can predict the secondary structure of proteins.


Answer:
Secondary structure prediction is a set of techniques in bioinformatics that aim to predict the
local secondary structures of proteins based only on knowledge of their amino acid sequence.
For proteins, a prediction consists of assigning regions of the amino acid sequence as
likely alpha helices, beta strands (often noted as "extended" conformations) or turns. The
success of a prediction is determined by comparing it to the results of the DSSP algorithm (or
similar e.g., STRIDE) applied to the crystal structure of the protein. Specialized algorithms
have been developed for the detection of specific well-defined patterns such as transmembrane
helices and coiled coils in proteins.
The best modern methods of secondary structure prediction in proteins were claimed to reach
80% accuracy after using machine learning and sequence alignments; this high accuracy allows
the use of the predictions as feature improving fold recognition and ab initio protein structure
prediction, classification of structural motifs, and refinement of sequence alignments. The
accuracy of current protein secondary structure prediction methods is assessed in
weekly benchmarks such as LiveBench and EVA.
Tertiary structure: The practical role of protein structure prediction is now more important
than ever. Massive amounts of protein sequence data are produced by modern large-
scale DNA sequencing efforts such as the Human Genome Project. Despite community-wide
efforts in structural genomics, the output of experimentally determined protein structures—
typically by time-consuming and relatively expensive X-ray crystallography or NMR
spectroscopy—is lagging far behind the output of protein sequences.
The protein structure prediction remains an extremely difficult and unresolved undertaking.
The two main problems are the calculation of protein free energy and finding the global
minimum of this energy. A protein structure prediction method must explore the space of
possible protein structures which is astronomically large. These problems can be partially
bypassed in "comparative" or homology modelling and fold recognition methods, in which the
search space is pruned by the assumption that the protein in question adopts a structure that is
close to the experimentally determined structure of another homologous protein. On the other
hand, the de novo protein structure prediction methods must explicitly resolve these problems.
The progress and challenges in protein structure prediction have been reviewed by Zhang.

5. Explain homology modelling for tertiary structure prediction.


Answer: Homology modeling is one of the computational structure prediction methods that are used
to determine protein 3D structure from its amino acid sequence. It is considered to be the most accurate
of the computational structure prediction methods. It consists of multiple steps that are straightforward
and easy to apply. There are many tools and servers that are used for homology modeling. There is no
single modeling program or server which is superior in every aspect to others. Since the functionality
of the model depends on the quality of the generated protein 3D structure, maximizing the quality of
homology modeling is crucial. Homology modeling has many applications in the drug discovery
process. Since drugs interact with receptors that consist mainly of proteins, protein 3D structure
determination, and thus homology modeling is important in drug discovery. Accordingly, there has been
the clarification of protein interactions using 3D structures of proteins that are built with homology
modeling. This contributes to the identification of novel drug candidates. Homology modeling plays an
important role in making drug discovery faster, easier, cheaper, and more practical. As new modeling
methods and combinations are introduced, the scope of its applications widens.

Homology modeling (comparative modeling) is one of the computational structure prediction methods
that are used to determine 3D structure of a protein from its amino acid sequence based on its template.
The basis for homology modeling are two major observations. First protein 3D structure is particularly
determined by its amino acid sequence. Second the structure of proteins is more conserved, and the
change happens at a much slower rate in relative to the sequence during evolution. As a result, similar
sequences fold into identical structures and even sequences with low relation take similar structures.
Homology modeling is the most accurate of the computational structure prediction methods [7]. 3D
structure predictions made by computational methods like de novo prediction and threading were
compared to homology modeling using Root Mean Square Deviation (RMSD) as a criterion. Homology
modeling was found to give 3D structures with the highest accuracy. Furthermore, it is a protein 3D
structure prediction method that needs less time and lower cost with clear steps. Thus, homology
modeling is widely used for the generation of 3D structures of proteins with high quality. This has
changed the ways of docking and virtual screening methods that are based on structure in the drug
discovery process.

Steps of Homology Modeling


Homology modeling is a structure prediction method that consists of multiple steps. Homology
modeling has common standard procedures with minor differences. The standard steps of homology
modeling are summarized in Figure 1 and the detail explanation is given below the figure.

6. Draw a decision-making chart for protein structure prediction.


Answer:
A flow chart for the proposed protein structure prediction method is shown in Fig. 1. It consists of four
main steps: (a)preconditioning; (b) construction of initial conformation with the all-atom model of a
protein; (c) energy minimization and evaluation of the static energy of the protein 3D structure with the
all-atom force eld parameters; and (d) generation of evolutionary related structures with the improved
PSO algorithm, random perturbation and fragment substitution.

Fig. 1 Flow chart of the protein structure prediction method.

7. What is the best protein structure prediction method according to you? Justify
your answer in comparison to other methods.
Answer: AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D
structure from its amino acid sequence. It regularly achieves accuracy competitive with
experiment.

DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI) have partnered to


create AlphaFold DB to make these predictions freely available to the scientific community.
The latest database release contains over 200 million entries, providing broad coverage
of UniProt (the standard repository of protein sequences and annotations). We provide
individual downloads for the human proteome and for the proteomes of 47 other key organisms
important in research and global health. We also provide a download for the manually curated
subset of UniProt (Swiss-Prot).

AlphaFold was the top-ranked protein structure prediction method by a large margin, producing
predictions with high accuracy. While the system still has some limitations, the CASP results
suggest AlphaFold has immediate potential to help us understand the structure of proteins and
advance biological research.
AF2 starts by employing multiple sequence alignments (MSA) with different regions weighted
by importance (Attention). It then uses the Evoformer module to extract information about
interrelationships between protein sequences and template structures. The structure module
treats the protein as a residue gas moved around by the network to generate the protein’s 3D
structure followed by local refinement to provide the final prediction. Assessing the quality of
predictions using the TM-score (a structural similarity metric with values [0,1] whose random
value is 0.3 and values greater than 0.4 indicate fold similarity), AF2 often produced models
with a TM-score greater than 0.9. Going beyond a TM-score of 0.85 indicates that both the
global fold and details are correct. In contrast, MSA-based methods generate an average fold
and plateau around a TM-score of 0.85. The key to why AF2 works is the fact the library of
single domain protein structures is essentially complete.

8. What are the major steps involved in drug discovery process follow identification
of the biological target? Explain.
Answer: The complexity in drug development has increased manifolds over the past 40 years,
requiring preclinical phase of drug development, investigational new drug (IND) application,
and complete clinical testing before marketing approval from the FDA. Generally, new drug
applications (NDAs) or biologics license applications (BLA) are reviewed comprehensively
before approval, and then drug performance is resubmitted to regulatory agencies for post-
marketing studies. The overarching goal is to bring more efficient and safer treatments to the
patients as quickly as possible after a thorough medical evaluation.
Phases And Stages
There are five critical steps in the U.S. FDA drug development process, including many phases
and stages within each of them. We will discuss each phase and different stages of drug
development to develop an in-depth understanding of the entire process. The phases of drug
development are –
Step 1: Discovery and Development
Step 2: Preclinical Research
Step 3: Clinical Development
Step 4: FDA Review
Step 5: FDA Post-Market Safety Monitoring.
Step 1: Discovery & Development

Drug discovery research is how new medications are discovered. Historically, drug discovery,
design, and development mostly started with identifying active ingredients from traditional
medicines or purely by chance. Later, classical pharmacology was used to investigate chemical
libraries including small molecules, natural products, or plant extracts, and find those with
therapeutic effects. Since human DNA was sequenced, reverse pharmacology has found
remedies to existing diseases through modern testing. Disease processes, molecular compound
tests, existing treatments with unanticipated effects, and new technologies spur drug discovery
timeline through the cycle below. Today the steps in drug discovery and development involve
screening hits, iterative medicinal chemistry, and optimization of hits to reduce potential drug
side effects (increasing affinity and selectivity). Efficacy or potency, metabolic stability (half-
life), and oral bioavailability are also improved in these steps of the drug development process.

Target Identification and Validation


Target identification finds a gene or protein (therapeutic agent) that plays a significant role in
disease. Afterward, scientists and researchers record the target’s therapeutic characteristics.
Drug targets must be efficacious, safe, usable, and capable of meeting clinical and commercial
requirements. To validate targets, researchers use modern tools and techniques such as disease
association, bioactive molecules, cell-based models, protein interactions, signaling pathways
analysis, functional analysis of genes, in vitro genetic manipulation, antibodies, and chemical
genomics. For example, the Sanger Whole Genome CRISPER library and Duolink PLA are
excellent sources for drug discovery targets.

Hit Discovery Process


Following target validation, compound screening assays are developed.

Assay Development and Screening


Assay development in drug discovery is a crucial component of drug discovery workflow.
Assays are test systems that evaluate the effects of the new drug candidate at the cellular,
molecular, and biochemical levels.

High Throughput Screening


High Throughput Screening (HTS) uses robotics, data processing/control software, liquid
handling devices, and sensitive detectors to rapidly conduct millions of pharmacological,
chemical, and genetic tests, eliminating hours of painstaking testing by scientists. HTS
identifies active compounds, genes, or antibodies that affect human molecules.

Hit To Lead
In the Hit to Lead (H2L) process, small molecule hits from an HTS are evaluated and optimized
in a limited way into lead compounds. These compounds then move on to the lead optimization
process.

Lead Optimization
In the lead optimization (LO) process, the lead compounds discovered in the H2L process are
synthesized and modified to improve potency and reduce side effects. Lead optimization
conducts experimental testing using animal efficacy models and ADMET tools, designing the
drug candidate.

Active Pharmaceutical Ingredients


Active pharmaceutical ingredients (APIs) are biologically active ingredients in a drug
candidate that produce desired effects. All drugs are made up of the API or APIs and excipients.
Excipients are inactive substances that deliver the drug into the human system. High Potency
Active Pharmaceutical Ingredients (HP APIs) are molecules that are effective at much smaller
dosage levels than standard APIs. They are classified based on toxicity, pharmacological
potency, and occupational exposure limits (OELs), and used in complex drug development
involving more than ten steps.
The drug discovery process gets narrowed when one lead compound is found for a drug
candidate, and the process of drug development starts.
Step 2: Preclinical Research
Once a lead compound is found, preclinical phase of drug development begins with in vivo
research to determine the efficacy and safety of the drug. Researchers determine the following
about the drug:
 Absorption, distribution, metabolization, and excretion information
 Potential benefits and mechanisms of action
 Best dosage, and administration route
 Side effects/adverse events
 Effects on gender, race, or ethnicity groups
 Interaction with other treatments
 Effectiveness compared to similar drugs
Preclinical Trials test the new drug on non-human subjects for efficacy, toxicity, and
pharmacokinetic (PK) information. These trials are conducted by scientists in vitro and in vivo
with unrestricted dosages.

Absorption, Distribution, Disposition, Metabolism, & Excretion


Absorption, Distribution, Disposition, Metabolism, & Excretion (ADME) is a Pharmacokinetic
(PK) process of measuring the ways the new drug affects the body. ADME involves
mathematical descriptions of each effect.

Proof of Principle / Proof of Concept


Proof of Principle (PoP) are studies that are successful in preclinical trials and early safety
testing. Proof of Concept (PoC) terminology is used almost interchangeably with PoP in drug
discovery and development projects. Successful PoP/PoC studies lead to program advancement
to the Phase II studies of dosages.

In Vivo, In Vitro, And Ex Vivo Assays


These three types of studies are conducted on the whole, living organisms or cells, including
animals and humans: or using non-living organisms or tissue extract. In vivo, preclinical
research examples are the development of new drugs using mice, rat, and dog models. In vitro
is research conducted in a laboratory. Ex vivo uses animal cells or tissues from a non-living
animal. Examples of ex vivo research assays are finding effective cancer treatment agents;
measurements of tissue properties (physical, thermal, electrical, and optical); and realistic
modeling for new surgical procedures. In an ex vivo assay, a cell is always used as the basis
for small explant cultures that provide a dynamic, controlled, and sterile environment.

In Silico Assays
In silico assays are test systems or biological experiments performed on a computer or via
computer simulation. These are expected to become increasingly popular with the ongoing
improvements in computational power, and behavioral understanding of molecular dynamics
and cell biology.

Drug Delivery
New drug delivery methods include oral, topical, membrane, intravenous, and inhalation. Drug
delivery systems are used for targeted delivery or controlled release of new drugs.
Physiological barriers in animal or human bodies may prevent drugs from reaching the targeted
area or releasing when they should. The goal is to prevent the drug from interacting with
healthy tissues while still being effective.
 Oral: Oral delivery of medications is reliable, cost-effective, and convenient for
patients. Oral drug delivery may not monitor precise dosages to the desired area but is
ideal for prophylactic vaccinations and nutritional regimens. Delayed action, stomach
enzyme destruction, absorption inconsistencies, or patients with gastrointestinal issues
or upset can occur, and patients must be conscious during administration.
 Topical: Topical drug delivery involves ointments, creams, lotions, or transdermal
patches that deliver a drug by absorption into the body. Topical delivery is more useful
for patient skin or muscular conditions — it is preferred by patients due to non-invasive
delivery and their ability to self-administer the medicine.
 Parenteral (IM, SC or LP Membrane): Parenteral drug delivery utilizes bodily
membranes, including intramuscular (IM), intraperitoneal (IP), or subcutaneous or
(SC). It is often used for unconscious patients and avoids epithelial barriers that are
difficult for drugs to cross.
 Parenteral (Intravenous): Intravenous injection is one of the fastest drug delivery
absorption methods. IV injection ensures entire doses of drugs enter the bloodstream,
and it is more effective than IM, SC, or LP membrane methods.
 Parenteral (Inhalation): Inhalation drug delivery gets the drug rapidly absorbed into
the mucosal lungs, nasal passages, throat, or mouth. Problems with inhalation delivery
include difficulty delivering the optimum dosage due to small mucosal surface areas
and patient discomfort. Pulmonary inhalation drug delivery uses fine drug powders or
macromolecular drug solutions. Lung fluids resemble blood, so they can absorb small
particles easily and deliver them into the bloodstream.

Formulation Optimization & Improving Bioavailability


Formulation optimization is ongoing throughout pre-clinical and clinical stages. It ensures
drugs are delivered to the proper place at the right time and in the right concentration.
Optimization may include overcoming solub.

Step 3: Clinical Drug Development Process


Once preclinical research is complete, researchers move on to clinical drug development,
including clinical trials and volunteer studies to finetune the drug for human use.

Complexity Of Study Design, Associated Cost & Implementation Issues


The complexity of Clinical Trial design and its associated costs and implementation issues may
affect trials carried out during this phase. Trials must be safe and efficacious and be completed
under the drug development budget, using a methodology to ensure the drug works as well as
possible for its intended purpose. This rigorous process must be set up correctly and enroll
many volunteers to be effective.

Clinical Trials– Dose Escalation, Single Ascending & Multiple Dose Studies
Proper dosing determines medication effectiveness, and clinical trial examine dose escalation,
single ascending, and multiple dose studies to determine the best patient dosage.
Phase I – Healthy Volunteer Study
This phase is the first time the drug is tested on humans; less than 100 volunteers will help
researchers assess the safety and pharmacokinetics, absorption, metabolic, and elimination
effects on the body, as well as any side effects for safe dosage ranges.

Phase II And Phase III – Studies in Patient Population


Phase II assesses drug safety and efficacy in an additional 100-500 patients, who may receive
a placebo or standard drug previously used as treatment. Analysis of optimal dose strength
helps create schedules while adverse events and risks are recorded. Phase III enrolls 1,000-
5,000 patients, enabling medication labeling and instructions for proper drug use. Phase III
trials require extensive collaboration, organization, and Independent Ethics Committee (IEC)
or Institutional Review Board (IRB) coordination and regulation in anticipation of full-scale
production following drug approval.

Biological Samples Collection, Storage & Shipment


During clinical trials, biological samples are collected, stored, and shipped from testing sites
according to global standards and regulations. Transport containers of biological samples may
include dry ice packs or other temperature stabilizing methods. Different requirements apply
to different types of biological samples.

Pharmacodynamic (PD) Biomarkers


PD biomarkers are molecular indicators of the drug’s effects on the target human area, and link
drug regimen and biological responses. This data can help select rational combinations of
targeted agents and optimize drug regimens and schedules. Rationality and hypothesis-testing
power are increased through the use of PD endpoints in human trials.

Pharmacokinetic Analysis
Pharmacokinetic Analysis is an experimental trial that determines the theory of how a new drug
behaves in the human body. The volume of distribution, clearance, and terminal half-life are
defined through compartmental modeling.

Bioanalytical Method Development and Validation


Bioanalytical methods detect analytes and metabolites such as drug or biomarkers in biological
or human samples to determine drug efficacy and safety. The complete bioanalytical
assay consists of sample collection, clean-up, analysis, and detection.

Drug (Analyte) & Metabolite Stability in Biological Samples


Stability is important in determining human drug efficacy, and biological samples are required.
Drug and drug metabolites are susceptible to degradation, which can lower drug concentration
over the life of the drug.

Blood, Plasma, Urine & Feces Sample Analysis for Drug and Metabolites
Biological samples used in clinical trials include blood, plasma, urine, and feces to determine
and analyze various properties and effects of the drug and its metabolites on humans.
Patient Protection – GCP, HIPAA, & Adverse Event Reporting
Human patients must always be protected during clinical trials, and Good Clinical Practices
(GCP), the Health Insurance Portability and Accountability Act (HIPAA), and adverse event
reporting to IEC/IRB regulates and ensures their safety.

Step 4: FDA Review


Once the new drug has been formulated for its best efficacy and safety, and the results from
clinical trials are available, it’s advanced forward for wholistic FDA review. At this time, the
FDA reviews and approves, or does not approve, the drug application submitted by the drug
development company.

Regulatory Approval Timeline


The new drug regulatory approval timeline may be standard, fast track, breakthrough,
accelerated approval, or priority review depending on its applications and necessity for
patients. If standard or priority review is required, the approval timeline may be up to an year.
Fast track, breakthrough, or accelerated approvals may occur sooner.

IND Application
IND applications are submitted to the FDA before starting clinical trials. If clinical trials are
ready to be conducted, and the FDA has not responded negatively about the drug, developers
may start the trials.

NDA / ANDA / BLA Applications


An NDA abbreviated new drug application (ANDA), or BLA is submitted to the FDA after
clinical trials demonstrate drug safety and efficacy. The FDA reviews study data and decides
whether to grant approval or not. Additional research or an expert advisory panel may be
required before a final decision is made.

Orphan Drug
An orphan drug is intended to treat disease so rare that financial sponsors are unwilling to
develop it under standard marketing conditions. These drugs may not be approved quickly or
at all.

Accelerated Approval
New drugs may be granted accelerated approval if there is strong evidence of positive impact
on a surrogate endpoint instead of evidence of impact on actual clinical benefits the drug
provides. Expedition of approval means the medication can help treat severe or life-threatening
conditions.

Reasons For Drug Failure


New drug applications may fail for a variety of reasons, including toxicity, efficacy, PH
properties, bioavailability, or inadequate drug performance.
 Toxicity: If the toxicity of a new drug is too high in human or animal patients, the drug
may be rejected due to safety concerns about its use following manufacture.
 Efficacy: If a new drug’s efficacy is not high enough or evidence is inconclusive, the
FDA may reject it.
 PK Properties or Bioavailability: PK properties or poor bioavailability due to low
aqueous solubility, or high first-pass metabolism, may also cause a drug to fail FDA
review. PK causes of drug failure include inadequate action duration and unanticipated
human drug interactions.
 Inadequate Drug Performance: If the new drug performs the desired function, but
only at a shallow level, the FDA may reject the application in favor of a formulation
that performs better.

Step 5: Post-Market Monitoring


Following drug approval and manufacturing, the FDA requires drug companies to monitor the
safety of its drug using the FDA Adverse Event Reporting System (FAERS) database. FAERS
helps FDA implement its post-marketing safety surveillance program. Through this program,
manufacturers, health professionals, and consumers report problems with approved drugs.

Here’s a summary of the FDA drug approval process discussed above.

9. Define tertiary structure of a protein? Discuss the method of homology modelling.


Answer: The practical role of protein structure prediction is now more important than
ever. Massive amounts of protein sequence data are produced by modern large-
scale DNA sequencing efforts such as the Human Genome Project. Despite community-wide
efforts in structural genomics, the output of experimentally determined protein structures—
typically by time-consuming and relatively expensive X-ray crystallography or NMR
spectroscopy—is lagging far behind the output of protein sequences.
The protein structure prediction remains an extremely difficult and unresolved undertaking.
The two main problems are the calculation of protein free energy and finding the global
minimum of this energy. A protein structure prediction method must explore the space of
possible protein structures which is astronomically large. These problems can be partially
bypassed in "comparative" or homology modelling and fold recognition methods, in which the
search space is pruned by the assumption that the protein in question adopts a structure that is
close to the experimentally determined structure of another homologous protein. On the other
hand, the de novo protein structure prediction methods must explicitly resolve these problems.
The progress and challenges in protein structure prediction have been reviewed by Zhang.
Homology modelling is a procedure to predict the 3D structure of a protein. It relies on a few
principles:
 The structure of a protein is uniquely determined by its amino acid
 Therefore, the sequence should, in theory, contain enough information to obtain the
structure
 Similar sequences have been found to adopt practically identical structures while
distantly related sequences can still fold into similar structures.
The predictive methods to adopt strongly depend on the percentage of sequence identity
between the protein of unknown structure (“target”) and a protein with known structure
(“template”).

Approach identity percentage/homology


Comparative modeling > 30% sequence identity
Threading/Fold recognition 0 – 30% sequence identity
Ab initio/de novo no homologous

The percentage of sequence identity also affects the quality of the final model and, therefore,
of the studies you can carry out with the model.
Sequence identity model quality
60-100% Comparable with average resolution NMR. Substrate specificity
30-60% Starting point for site-directed mutagenesis studies
< 30% Serious errors

HOMOLOGY MODELLING BASICALLY CONSISTS OF 8 STEPS


1. Template recognition and initial alignment
2. Alignment correction
3. Backbone generation
4. Loop modeling
5. Side chain modelling
6. Model optimisation
7. Model validation (by hand or using different servers)
8. Iteration to correct mistakes (if any)

STEP 1: TEMPLATE RECOGNITION AND INITIAL ALIGNMENT


 To identify the template, the program compares the query sequence to all the
sequences of known structures in the PDB (e.g. BLAST)
 Usually, the template structure with the highest sequence identity and coverage
will be the first option
 Other considerations:
 conformational state (i.e. active or inactive)
 present co-factors
 other molecules or multimeric complexes
 It is possible to choose multiple templates and build multiple models
 It is possible to combine multiple templates into one structure that is used for
modeling

STEP 2: ALIGNMENT CORRECTION


Having identified one or more possible modelling templates using the initial screen described
above, more sophisticated methods are needed to arrive at a better alignment

STEP 3: BACKBONE GENERATION


 When the alignment is ready, the actual model building can start
 Creating the backbone is trivial for most of the model: one simply transfers the
coordinates of those template residues that show up in the alignment with the model
 If two aligned residues differ, the backbone coordinates for N, Cα, C and O and often
also the Cβ can be copied
 Conserved residues can be copied completely to provide an initial guess

STEP 4: LOOP MODELING


 For the majority of homology model building cases, the alignment between model and
template sequence contains gaps
 Gaps in the model-sequence are addressed by omitting residues from the template
 Gaps in the template sequences are treated by inserting missing residues the into the
continuous backbone
 Changes in loop conformation are notoriously hard to predict
Loop modelling: a search is made through the PDB for known loops containing endpoints
that match the residues between which the loop is to be inserted.

STEP 5: SIDE CHAIN MODELING


 Libraries of common rotamers extracted from high resolution X-ray structures are often
used to position side chains
 The various rotamers are subsequently explored and scored with a variety of energy
functions

STEP 6: MODEL OPTIMISATION


 To predict the side chain rotamers with high accuracy, we need the correct backbone,
which in turn depends on the rotamers and their packing
 The common approach to address this problem is to iteratively model the rotamers and
backbone structure
 First, we predict the rotamers, then remodel the backbone to accommodate rotamers,
followed by a round of refitting the rotamers to the new backbone
 This process is repeated until the solution converges
 This boils down to a series of rotamer prediction and energy minimization steps

STEP 7: MODEL VALIDATION


 Every protein structure contains errors, and homology models are no exception
 The number of errors (for a given method) mainly depends on two values:
o The percentage sequence identity between template and model-sequence
o The number of errors in the template
 There are two principally different ways to estimate errors in a structure
o Calculating the model’s energy based on a force field
o Determining normality indices that describe how well a given characteristic of
the model resembles the same characteristic in real structures

STEP 8: ITERATION TO CORRECT MISTAKES


 When errors in the model are recognised and located, they can be corrected by iterating
portions of the homology modeling process.
 Small errors that are introduced during the optimisation step can be removed by running
a shorter molecular dynamics simulation
 An error in a loop can be corrected by choosing another loop conformation in the loop
modeling step
 Large mistakes in the backbone conformation sometimes require the complete process
to be repeated with another alignment or even with a different template

10. Explain the rational drug design approach.


Answer: Rational drug design refers to the development of medications based on the study of
the structures and functions of target molecules. That is to say, the role of rational drug design
is to use a methodological approach to coming up with a new drug, as opposed to blindly
hoping some stroke of luck helps design a new drug, or instead of randomly testing hundreds
of drug molecules in hopes that one of them binds to a receptor and exerts a therapeutic effect.
During rational drug design, researchers take three general steps to create a new drug:
Step 1. Identify a receptor or enzyme that is relevant to a disease they are going to design
a drug for.
Step 2. Elucidate the structure and function of this receptor or enzyme.
Step 3. Use the information from step two in order to design a drug molecule that
interacts with the receptor or enzyme in a therapeutically beneficial way.

This seems easy, but boy does it have to take a lot of important factors into account. For
example, researchers must account for whether or not the drug will bind to a specific molecule
or more than one. Binding to other molecules might bring about more adverse effects. How
well the drug will bind to the target molecule needs to be clarified. If it can't bind to the target
well, it might be useless.

The shape and charge of the molecule must be examined in order to determine how this might
affect the way it not only binds to the target, but is absorbed, distributed, metabolized, and
excreted by the body.

The general process of rational drug design is as follows: scientists must first identify a target
for a certain therapeutic need. Next, the structure of the target protein must be defined. Lastly,
the drug’s structure must be designed so that it specifically interacts with the target protein. If
a biological target cannot be identified, an alternative route is taken. Here, we will discuss all
three major rational drug design approaches, including structure-based (as briefly described
above), pharmacophore-based (an alternative approach used when the biologic target cannot
be defined), and new lead generation.
Structure-based approaches
The first port of call-in rational drug design is establishing the 3-dimensional structure of the
known target (often a protein or a nucleic acid). The process of identifying this structure usually
involves producing a working computational model from crystallographic data. However, in
recent years, the use of active ligands to develop models of the active binding site are becoming
more commonly used. From this, scientists work on designing a drug that has the correct
structure required to specifically bind to the target site and induce the biological activity
required for therapeutic effect.

Pharmacophore-Based Approaches
In cases where scientists cannot identify a target, or they cannot determine the structure of a
target, an alternative strategy is used. This approach leverages data on drugs that produce the
same intended effect as the one that is being developed. Generally, it is assumed that drugs that
induce the same biological activity are acting on the same pathways and likely the same
biological target.

Therefore, it is also assumed that to trigger these same biological responses, the new drug must
have the same set of structural features to allow it to successfully interact with the target. This
method, based on the use of analogs to develop a model of the requirements to induce
therapeutic activity is known as the pharmacophore model.

In the pharmacophore model, scientists first determine the three-dimensional features that are
critical for biological activity. Next, the optimal combinations for the structural features are
calculated. Finally, scientists design a drug structure that encompasses this optimal
combination of features.

Pharmacophore approaches have benefited in recent years from advances in the fields of
molecular biology, protein crystallography, and computational chemistry, which have helped
to aid the accuracy of drug design binding affinity predictions.

New lead generation


Finally, de novo design methods can also be used to design new drug structures using databases
of known compounds with particular structural features. Drugs designed by this approach may
be created by sequentially adding or joining molecular fragments to a structure or by adding
functionality by evolving the structure of a molecular scaffold.
11. Discuss the In Silico Drug Design Methods
Answer: In silico drug design, sometimes referred to as computational medicine, is the application of
in silico research towards problems regarding health and medicine. “In silico” methodologies refer to
computer simulations in which researchers can diagnose, treat, and prevent different diseases and
ailments. By exploiting the exponential advancements that have occurred in the fields of computation
and data processing it is now both possible, and practical to simulate real biological pathways through
virtual environments. This modus is often geared towards ligand-based medicine and is a vital division
of drug modeling and preparation. The bioactive ligands themselves have granted us a plethora of novel
data regarding drug and protein targets.

The nature of In Silico Drug Design


After its first conception in 1989 at the “Cellular Automata Theory and Applications Workshop” at the
National Autonomous University of Mexico, it is now the front runner of many fields such as radiation
oncology, genetics, physiology, and biochemistry.

In the future, we will see a positive correlation between the exponential growth of technology, and the
practicality of in silico drug design, given that this process is grounded in computation and data storage.
This was first apparent in 1990, as our attitude towards processing and data storage increased, allowing
for the entirety of the human genome to be sequenced.

Silico drug design and modeling now pose a very attractive alternative to human and animal testing. It
has even been hypothesized that these models can render testing on any living organism, completely
obsolete. To take it one step further, virtual patients may become conventional, replacing actual patients
or cadavers within the medical school curriculum.

Model complex software (e.g., LEGEND, CrystalDock, GANDI) are used to analyze molecular
interactions of interest, identifying potential binding fragments for future drugs. These virtual fragments
are linked to previously known docking and protein binding sites (receptors). Using a combination of
genetic algorithms and chemical/virtual screening, new drugs can be developed at a much faster rate.

How in Silico can be Used within Drug Design


Through In silico medicine, we can now obtain an earlier prediction of success regarding different
medicinal compounds, and we can garner a better understanding of the adverse side effects of drugs
earlier on in their discovery process.

We saw our first case of vaccine rendering via genomic information way back in 2003, a technique
christened as reverse vaccinology. Using genomic information and sequencing performed from
microbial genomes, potential antigens can be discovered. The elucidations of each antigen, and each
corresponding pathogen, have led to a revitalization in vaccine development.

An example of this can be found in the work of Alessandro Sette’s team of immunologists, developing
a vaccine against serogroup B meningococcus. The computational screening of vaccination targets
revealed the T cell epitopes CD8+ and CD4+, which led to a clearer understanding of pathogenetic
immunity towards vaccination, which corresponds to a better understanding of the nature of T cells.
The process of Ligand Design Using in Silico Methodologies
These ligands are first prepared and modeled through computational means, acquiring previously
derived data to obtain a two-dimensional or three-dimensional structure of the parent molecule, and
then relating this data to other ligand interactions. This technique is performed by using prior models
of biological systems to predict the molecular dynamics in a medicinal context.

The next step in ligand design is target predication. This process involves predicting drug targets
through computational extrapolation tools. Through homology modeling, and the digital assaying of
biochemical pathways, the active site and binding site of different drugs can often be found without the
use of any wet chemistry or lab workup. By circumventing this process, both medical and research
centers have saved both time and resources while on their trek towards drug discovery.

The third and final step in drug discovery, just before trials, is In-Silico bioactivity analysis. This is
accomplished through chemical screening, and or virtual screening.

Chemical screening is performed on the lab bench after digital assays narrow down a large number of
chemical compounds to be tested for a specified biological function.

In this form of assay, a large sum of compounds is tested for biological targets such as channel proteins,
hormone receptors, and others. The biological function of these ligands is often studied within test tubes.

For example, a fixed amount of target protein is kept in test tubes, within an appropriate buffer medium.
Once this is done, different compounds selected from a predetermined library are added to the test tubes
at a fixed concentration. Then the inhibitory activity of the compound is measured using an appropriate
test. Compounds showing inhibitory activity are taken as hits and are selected for the next round of
assay to determine their operating inhibitory activity, to find the most potent inhibitor. This is generally
done with the help of robotics.

In virtual screening, the binding affinity of compounds from a data library is calculated for an in-silico
biological target using molecular docking programs. The veracity of these results is evident and can
later be confirmed through experimental procedures.

Within the field of biochemistry, it is often repeated that structure equates to function, and
understanding the structure of ligands, protein targets, and other molecular machines through in silico
methods can aid in the discovery of drug-resistant mechanisms. It will also evolve precision medicine
and can be used to understand the underlying causes of many inherited diseases.

Determining the allostery of these molecules can help us identify alternative drug targeting sites and
advance our foundations of drug design as a whole.

12. Explain the Chou-Fasman method for protein secondary structure analysis.
Answer:
The Chou-Fasman method was among the first secondary structure prediction algorithms developed
and relies predominantly on probability parameters determined from relative frequencies of each amino
acid's appearance in each type of secondary structure (Chou and Fasman, 1974). The original Chou-
Fasman parameters, determined from the small sample of structures solved in the mid-1970s, produce
poor results compared to modern methods, though the parameterization has been updated since it was
first published. The Chou-Fasman method is roughly 56-60% accurate in predicting secondary
structures. The method is based on analyses of the relative frequencies of each amino acid in alpha
helices, beta sheets, and turns based on known protein structures solved with X-ray crystallography.
From these frequencies a set of probability parameters were derived for the appearance of each amino
acid in each secondary structure type, and these parameters are used to predict the probability that a
given sequence of amino acids would form a helix, a beta strand, or a turn in a protein. The method is
at most about 50–60% accurate in identifying correct secondary structures, which is significantly less
accurate than the modern machine learning–based techniques.

The evolutionary conservation of secondary structures can be exploited by simultaneously assessing


many homologous sequences in a multiple sequence alignment, by T. Ashok Kumar - 16 - calculating
the net secondary structure propensity of an aligned column of amino acids. In concert with larger
databases of known protein structures and modern machine learning methods such as neural networks
and support vector machines, these methods can achieve up 80% overall accuracy in globular proteins.
The theoretical upper limit of accuracy is around 90%, partly due to idiosyncrasies in DSSP assignment
near the ends of secondary structures, where local conformations vary under native conditions but may
be forced to assume a single conformation in crystals due to packing constraints. Limitations are also
imposed by secondary structure prediction's inability to account for tertiary structure; for example, a
sequence predicted as a likely helix may still be able to adopt a beta-strand conformation if it is located
within a beta-sheet region of the protein and its side chains pack well with their neighbors. Dramatic
conformational changes related to the protein's function or environment can also alter local secondary
structure.

The Chou–Fasman method predicts helices and strands in a similar fashion, first searching linearly
through the sequence for a "nucleation" region of high helix or strand probability and then extending
the region until a subsequent four-residue window carries a probability of less than 1. As originally
described, four out of any six contiguous amino acids were sufficient to nucleate helix, and three out of
any contiguous five were sufficient for a sheet. The probability thresholds for helix and strand
nucleations are constant but not necessarily equal; originally 1.03 was set as the helix cutoff and 1.00
for the strand cutoff.

Turns are also evaluated in four-residue windows but are calculated using a multi-step procedure
because many turn regions contain amino acids that could also appear in helix or sheet regions. Four-
residue turns also have their own characteristic amino acids; proline and glycine are both common in
turns. A turn is predicted only if the turn probability is greater than the helix or sheet probabilities and a
probability value based on the positions of particular amino acids in the turn exceeds a predetermined
threshold. The turn probability p(t) is determined as:

p(t )  pt ( j )  pt ( j  1)  pt ( j  2)  pt ( j  3)
where j is the position of the amino acid in the four-residue window. If p(t) exceeds an arbitrary cutoff
value (originally 7.5e–3), the mean of the p(j)'s exceeds 1, and p(t) exceeds the alpha helix and beta
sheet probabilities for that window, then a turn is predicted. If the first two conditions are met but the
probability of a beta sheet p(b) exceeds p(t), then a sheet is predicted instead.
13. Describe GOR statistical method for secondary structure prediction.
Answer: The GOR method (short for Garnier–Osguthorpe–Robson) is an information theory-based
method for the prediction of secondary structures in proteins. It was developed in the late 1970s shortly
after the simpler Chou–Fasman method. Like Chou–Fasman, the GOR method is based
on probability parameters derived from empirical studies of known protein tertiary structures solved
by X-ray crystallography. However, unlike Chou–Fasman, the GOR method takes into account not only
the propensities of individual amino acids to form particular secondary structures, but also
the conditional probability of the amino acid to form a secondary structure given that its immediate
neighbors have already formed that structure. The method is therefore essentially Bayesian in its
analysis.

Method: The GOR method analyzes sequences to predict alpha helix, beta sheet, turn, or random
coil secondary structure at each position based on 17-amino-acid sequence windows. The original
description of the method included four scoring matrices of size 17×20, where the columns correspond
to the log-odds score, which reflects the probability of finding a given amino acid at each position in
the 17-residue sequence. The four matrices reflect the probabilities of the central, ninth amino acid
being in a helical, sheet, turn, or coil conformation. In subsequent revisions to the method, the turn
matrix was eliminated due to the high variability of sequences in turn regions (particularly over such a
large window). The method was considered as best requiring at least four contiguous residues to score
as alpha helices to classify the region as helical, and at least two contiguous residues for a beta sheet.

Algorithm: The mathematics and algorithm of the GOR method were based on an earlier series of
studies by Robson and colleagues reported mainly in the Journal of Molecular Biology and The
Biochemical Journal. The latter describes the information theoretic expansions in terms of conditional
information measures. The use of the word "simple" in the title of the GOR paper reflected the fact that
the above earlier methods provided proofs and techniques somewhat daunting by being rather
unfamiliar in protein science in the early 1970s; even Bayes methods were then unfamiliar and
controversial. An important feature of these early studies, which survived in the GOR method, was the
treatment of the sparse protein sequence data of the early 1970s by expected information measures.
That is, expectations on a Bayesian basis considering the distribution of plausible information measure
values given the actual frequencies (numbers of observations). The expectation measures resulting from
integration over this and similar distributions may now be seen as composed of "incomplete" or
extended zeta functions, e.g. z (s,observed frequency) – z (s, expected frequency) with incomplete zeta

 2    13    1 4   ......   1 n  . The GOR method used s = 1. Also, in the


s s s s
function z ( s, n)  1  1
GOR method and the earlier methods, the measure for the contrary state to e.g., helix H, i.e. ~H, was
subtracted from that for H, and similarly for beta sheet, turns, and coil or loop. Thus, the method can be
seen as employing a zeta function estimate of log predictive odds. An adjustable decision constant could
also be applied, which thus implies a decision theory approach; the GOR method allowed the option to
use decision constants to optimize predictions for different classes of protein. The expected information
measure used as a basis for the information expansion was less important by the time of publication of
the GOR method because protein sequence data became more plentiful, at least for the terms considered
at that time. Then, for s=1, the expression z (s,observed frequency) – z (s,expected frequency)
approaches the natural logarithm of (observed frequency / expected frequency) as frequencies increase.
However, this measure (including use of other values of s) remains important in later more general
applications with high-dimensional data, where data for more complex terms in the information
expansion are inevitably sparse.

14. Explain Application of bioinformatics in drug discovery and drug designing.


Answer: Bioinformatic techniques hold a lot of prospective in target identification (generally
proteins/enzymes), target validation, understanding the protein, evolution and phylogeny and protein
modeling. Bioinformatic analysis can not only accelerate drug target identification and drug candidate
screening and refinement, but also facilitate characterization of side effects and predict drug resistance.

One of the major thrusts of current bioinformatics approaches is the prediction and identification of
biologically active candidates, and mining and storage of related information. It also provides strategies
and algorithm to predict new drug targets and to store and manage available drug target information.

In molecular docking: Docking is an automated computer algorithm that attempts to find the best
matching between two molecules which is a computational determination of binding affinity between
molecules.
This includes determining the orientation of the compound, its conformational geometry, and the
scoring. The scoring may be a binding energy, free energy, or a qualitative numerical measure.

In some way, every docking algorithm automatically tries to put the compound in many different
orientations and conformations in the active site, and then computes a score for each.

Some bioinformatics programs store the data for all of the tested orientations, but most only keep a
number of those with the best scores.

Docking can be done using bioinformatics tools which are able to search a database containing
molecular structures and retrieve the molecules that can interact with the query structure.

It also aids in the building up chemical and biological information databases about ligands and
targets/proteins to identify and optimize novel drugs.

It is involved in devising in silico filters to calculate drug likeness or pharmacokinetic properties for the
chemical compounds prior to screening to enable early detection of the compounds which are more
likely to fail in clinical stages and further to enhance detection of promising entities.

Bioinformatics tools help in the identification of homologs of functional proteins such as motif, protein
families or domains.

It helps in the identification of targets by cross species examination by the use of pairwise or multiple
alignments.

The tools help in the visualization of molecular models.

It allows identifying drug candidates from a large collection of compound libraries by means of virtual
high-throughput screening (VHTS).
Homology modeling is extensively used for active site prediction of candidate drugs.

“In Silico” is an expression used to mean “performed on computer or via computer simulation.”
In Silico drug designing is thus the identification of the drug target molecule by employing
bioinformatics tools.

The inventive process of finding new medications based on the knowledge of a biological target is
called as drug designing. It can be accomplished in two ways:

 Ligand based drug design


Relies on knowledge of other molecules that bind to the biological target of interest.

 Structure-based drug design


o Relies on knowledge of the three-dimensional structure of the biological target
obtained through methods such as homology modeling, NMR spectroscopy, X-
ray crystallography etc.
o Drug discovery process is a critical issue in the pharmaceutical industry since it
is a very costly and time-consuming process to produce new drug potentials and
enlarge the scope of diseases incurred.
o In both methods of designing drugs, computers and various bioinformatics tool
come handy. Thus, in silico drug designing today is very crucial means to allay
the arduous task of manual and experimental designing of drugs.
o In silico technology alone, however, cannot guarantee the identification of new,
safe and effective lead compound but more realistically future success depends on
the proper integration of new promising technologies with the experience and
strategies of classical medicinal chemistry.

The Process of Drug Designing


The drug discovery process involves the identification of the lead structure followed by the
synthesis of its analogs, their screening to get candidate molecules for drug development.
In the traditional drug discovery process, the steps include:
1. Identification of the suitable drug target which are biomolecules mainly including
DNA, RNA and proteins (such as receptors, transporters, enzymes and ion channels).
2. Validation of such targets is necessary to exhibit a sufficient level of ‘confidence’ and
to know their pharmacological relevance to the disease under investigation. This can be
performed from very basic levels such as cellular, molecular levels to the whole animal
level.
3. Identification of effective compounds such as inhibitors, modulators or
antagonists for such target is called lead identification where the design and
development of a suitable assay is done to monitor the effect on the target under study.
4. Compounds showing dose-dependent target modulation in terms of a certain degree of
confidence are processed further as lead compounds.
5. Subsequently, the experiments are performed on the animal models in the
laboratories and the positive results are then optimized in terms of potency and
selectivity.
6. Assessing of the physicochemical properties, pharmacokinetic and safety
features are also assessed before they become candidates for drug development.

Even though most of the processes depend on experimental tasks, in silico approaches are
playing important roles in every stage of this drug discovery pipeline which are described
below:

In silico Methods in Drug Discovery and the role of Bioinformatics


 In silico drug design represents computational methods and resources that are used to
facilitate the opportunities for future drug lead discovery.
 The explosion of bioinformatics, cheminformatics, genomics, proteomics, and
structural information has provided hundreds of new targets as well as new ligands.

15. Discuss the use of RASMOL program.


Answer: RasMol is a molecular graphics program intended for the visualization of proteins,
nucleic acids and small molecules. The program is aimed at display, teaching and generation
of publication quality images. RasMol runs on wide range of architectures and operating
systems including SGI, sun4, sun3, sun386i, DEC, HP and E&S workstations, DEC Alpha
(OSF/1, Open VMS and Windows NT), IBM RS/6000, Cray, Sequent, VAX VMS (under DEC
windows), IBM PC (under Microsoft Windows, Windows NT, OS/2, Linux, BSD386 and
*BSD), Apple Macintosh and PowerMac. UNIX and VMS versions require an 8bit, 24bit or
32bit X Windows frame buffer (X11R4 or later). The X Windows version of RasMol provides
optional support for a hardware dials box and accelerated shared memory communication (via
the XInput and MIT- SHM extensions) if available on the current X Server.

The program reads in molecular co-ordinate files and interactively displays the molecule on
the screen in a variety of representations and colour schemes. Supported input file formats
include Brookhaven Protein Databank (PDB), Tripos Associates' Alchemy and Sybyl Mol2
formats, Molecular Design Limited's (MDL) Mol file format, Minnesota Supercomputer
Centre's (MSC) XYZ (XMol) format and CHARMm format files. If connectivity information
is not contained in the file this is calculated automatically. The loaded molecule can be shown
as wireframe bonds, cylinder 'Dreiding' stick bonds, alpha-carbon trace, space-filling (CPK)
spheres, macromolecular ribbons (either smooth shaded solid ribbons or parallel strands),
hydrogen bonding and dot surface representations. Different parts of the molecule may be
represented and coloured independently of the rest of the molecule or displayed in several
representations simultaneously. The displayed molecule may be rotated, translated, zoomed
and z-clipped (slabbed) interactively using either the mouse, the scroll bars, the command line
or an attached dial box. RasMol can read a prepared list of commands from a 'script' file (or
via inter-process communication) to allow a given image or viewpoint to be restored quickly.
RasMol can also create a script file containing the commands required to regenerate the current
image. Finally, the rendered image may be written out in a variety of formats including either
raster or vector PostScript, GIF, PPM, BMP, PICT, Sun rasterfile or as a MolScript input script
or a Kinemage.
RasMol has been developed at the University of Edinburgh's Biocomputing Research Unit and
the BioMolecular Structure Department, Glaxo Research and Development, Greenford, U.K.

Substitution matrix
Substitution matrices such as BLOSUM are used for sequence alignment of proteins. A
Substitution matrix assigns a score for aligning any possible pair of residues. In general,
different substitution matrices are tailored to detecting similarities among sequences that are
diverged by differing degrees. A single matrix may be reasonably efficient over a relatively
broad range of evolutionary change. The BLOSUM-62 matrix is one of the best substitution
matrices for detecting weak protein similarities. BLOSUM matrices with high numbers are
designed for comparing closely related sequences, while those with low numbers are designed
for comparing distant related sequences. For example, BLOSUM-80 is used for alignments that
are more similar in sequence, and BLOSUM-45 is used for alignments that have diverged from
each other. For particularly long and weak alignments, the BLOSUM-45 matrix may provide
the best results. Short alignments are more easily detected using a matrix with a higher "relative
entropy" than that of BLOSUM-62. The BLOSUM series does not include any matrices with
relative entropies suitable for the shortest queries.

Difference between Cladogram and Phylogenetic Tree

Cladogram vs Phylogenetic Tree

Cladogram is not an evolutionary tree. Phylogenetic tree is an evolutionary


Therefore, it doesn’t show evolutionary tree. It shows evolutionary
relationships. relationships.
Usage
Cladogram represents a hypothesis about
Phylogenetic tree represents the true
the actual evolutionary history of a
evolutionary history of organisms.
group.
Length of the Branches
Cladogram is drawn with equal length.
Branch length of a phylogenetic tree
The length of the branch does not
indicates the evolutionary distance.
represent an evolutionary distance.
Indication of the Evolutionary Time
Cladogram does not indicate the amount Phylogenetic tree indicates the
of evolutionary time when separating the amount of evolutionary time when
organisms’ taxa. separating the organisms’ taxa.

You might also like