Question Bank (Bioinformatics I)
Question Bank (Bioinformatics I)
Unit 1
Applications of Bioinformatics
Bioinformatics and its application depend on taking out useful facts and figures from a
collection of data reserved to be processed into useful information. Bioinformatics focuses its
scope on the areas of 3D image processing, 3D modelling of living cells, image analysis, drug
development, and a lot more. The most important application of bioinformatics can be seen in
the field of medicine, where its data is used to create antidotes for infectious and harmful
diseases.
The main application of bioinformatics is to make the complete use of natural processes more
usable and less complicated. Some examples of the application of bioinformatics are as follows.
Bioinformatics is largely used in gene therapy.
This branch finds application in evolutionary concepts.
Microbial analysis and computing.
Understanding protein structure and modeling.
Storage and retrieval of biotechnological data.
In the finding of new drugs.
In agriculture to understand crop patterns, pest control, and crop management.
Applications of Bioinformatics
Scope of Bioinformatics
The main scope of Bioinformatics is to fetch all the relevant data and process it into useful
information. It also deals with –
Management and analysis of a wide set of biological data.
It is specially used in human genome sequencing where large sets of data are being
handled.
Bioinformatics plays a major role in the research and development of the biomedical
field.
Bioinformatics uses computational coding for several applications that involve finding
gene and protein functions and sequences, developing evolutionary relationships, and
analyzing the three-dimensional shapes of proteins.
Research works based on genetic disease and microbial disease entirely depend on
bioinformatics, where the derived information can be vital to produce personalized
medicines.
Scope of Bioinformatics
DDBJ (Japan), GenBank (USA) and European Nucleotide Archive (Europe) are repositories
for nucleotide sequence data from all organisms. All three accept nucleotide sequence
submissions, and then exchange new and updated data on a daily basis to achieve optimal
synchronization between them.
4. Write two names of protein databases.
Answer: A protein database is a collection of data that has been constructed from physical,
chemical and biological information on sequence, domain structure, function, three‐
dimensional structure and protein‐protein interactions. Collectively, protein databases may
form a protein sequence database.
The PRIMARY databases hold the experimentally determined protein sequences inferred
from the conceptual translation of the nucleotide sequences. This, of course, is not
experimentally derived information, but has arisen as a result of interpretation of the nucleotide
sequence information and consequently must be treated as potentially containing
misinterpreted information. There is a number of primary protein sequence databases, and each
requires some specific consideration.
a. Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD):
The PIR-PSD is a collaborative endeavor between the PIR, the MIPS (Munich
Information Centre for Protein Sequences, Germany) and the JIPID (Japan International
Protein Information Database, Japan).
The PIR-PSD is now a comprehensive, non-redundant, expertly annotated, object-
relational DBMS.
A unique characteristic of the PIR-PSD is its classification of protein sequences based
on the superfamily concept.
The sequence in PIR-PSD is also classified based on homology domain and sequence
motifs.
Homology domains may correspond to evolutionary building blocks, while sequence
motifs represent functional sites or conserved regions.
The classification approach allows a more complete understanding of sequence
function-structure relationship.
b. SWISS-PROT
The other well known and extensively used protein database is SWISS-PROT. Like
the PIR-PSD, this curated proteins sequence database also provides a high level of
annotation.
The data in each entry can be considered separately as core data and annotation.
The core data consists of the sequences entered in common single letter amino acid
code, and the related references and bibliography. The taxonomy of the organism from
which the sequence was obtained also forms part of this core information.
The annotation contains information on the function or functions of the protein, post-
translational modification such as phosphorylation, acetylation, etc., functional and
structural domains and sites, such as calcium binding regions, ATP-binding sites, zinc
fingers, etc., known secondary structural features as for examples alpha helix, beta
sheet, etc., the quaternary structure of the protein, similarities to other protein if any,
and diseases that may arise due to different authors publishing different sequences for
the same protein, or due to mutations in different strains of an described as part of the
annotation.
TrEMBL (for Translated EMBL) is a computer-annotated protein sequence database that is
released as a supplement to SWISS-PROT. It contains the translation of all coding sequences
present in the EMBL Nucleotide database, which have not been fully annotated. Thus, it may
contain the sequence of proteins that are never expressed and never actually identified in the
organisms.
The secondary databases are so termed because they contain the results of analysis of the
sequences held in primary databases. Many secondary protein databases are the result of
looking for features that relate different proteins. Some commonly used secondary databases
of sequence and structure are as follows:
a. PROSITE:
A set of databases collects together patterns found in protein sequences rather than the
complete sequences. PROSITE is one such pattern database.
The protein motif and pattern are encoded as “regular expressions”.
The information corresponding to each entry in PROSITE is of the two forms – the
patterns and the related descriptive text.
b. PRINTS:
In the PRINTS database, the protein sequence patterns are stored as ‘fingerprints’. A
fingerprint is a set of motifs or patterns rather than a single one.
The information contained in the PRINT entry may be divided into three sections. In
addition to entry name, accession number and number of motifs, the first section
contains cross-links to other databases that have more information about the
characterized family.
The second section provides a table showing how many of the motifs that make up the
fingerprint occurs in the how many of the sequences in that family.
The last section of the entry contains the actual fingerprints that are stored as multiple
aligned sets of sequences, the alignment is made without gaps. There is, therefore, one
set of aligned sequences for each motif.
c. MHCPep:
MHCPep is a database comprising over 13000 peptide sequences known to bind the
Major Histocompatibility Complex of the immune system.
Each entry in the database contains not only the peptide sequence, which may be 8 to
10 amino acid long but in addition has information on the specific MHC molecules to
which it binds, the experimental method used to assay the peptide, the degree of activity
and the binding affinity observed , the source protein that, when broken down gave rise
to this peptide along with other, the positions along the peptide where it anchors on the
MHC molecules and references and cross-links to other information.
d. Pfam
Pfam contains the profiles used using Hidden Markov models.
HMMs build the model of the pattern as a series of the match, substitute, insert or
delete states, with scores assigned for alignment to go from one state to another.
Each family or pattern defined in the Pfam consists of the four elements. The first is
the annotation, which has the information on the source to make the entry, the method
used and some numbers that serve as figures of merit.
The second is the seed alignment that is used to bootstrap the rest of the sequences
into the multiple alignments and then the family.
The third is the HMM profile.
The fourth element is the complete alignment of all the sequences identified in that
family.
b. PRINTS:
In the PRINTS database, the protein sequence patterns are stored as ‘fingerprints’. A
fingerprint is a set of motifs or patterns rather than a single one.
The information contained in the PRINT entry may be divided into three sections. In
addition to entry name, accession number and number of motifs, the first section
contains cross-links to other databases that have more information about the
characterized family.
The second section provides a table showing how many of the motifs that make up the
fingerprint occurs in the how many of the sequences in that family.
The last section of the entry contains the actual fingerprints that are stored as multiple
aligned sets of sequences, the alignment is made without gaps. There is, therefore, one
set of aligned sequences for each motif.
c. MHCPep:
MHCPep is a database comprising over 13000 peptide sequences known to bind the
Major Histocompatibility Complex of the immune system.
Each entry in the database contains not only the peptide sequence, which may be 8 to
10 amino acid long but in addition has information on the specific MHC molecules to
which it binds, the experimental method used to assay the peptide, the degree of activity
and the binding affinity observed , the source protein that, when broken down gave rise
to this peptide along with other, the positions along the peptide where it anchors on the
MHC molecules and references and cross-links to other information.
d. Pfam
Pfam contains the profiles used using Hidden Markov models.
HMMs build the model of the pattern as a series of the match, substitute, insert or
delete states, with scores assigned for alignment to go from one state to another.
Each family or pattern defined in the Pfam consists of the four elements. The first is
the annotation, which has the information on the source to make the entry, the method
used and some numbers that serve as figures of merit.
The second is the seed alignment that is used to bootstrap the rest of the sequences
into the multiple alignments and then the family.
The third is the HMM profile.
The fourth element is the complete alignment of all the sequences identified in that
family.
It is a purely a system for accessing and searching the databases at NCBI, It cannot be setup at
one’s server. ENTREZ has access to all NCBI databases, such as PubMed, Books, OMIM,
OMIA, Gene, Genome, Nucleotide etc.
NCBI: To maintain biological data repositories and provide biocomputing services, we must
mention the leading American information provider, the National Center for Biotechnology
Information (NCBI). The NCBI was established in 1988 as a division of the National Library
of Medicine (NLM) and is located on the campus of the National Institute of Health (NIH).
The NLM was chosen to host NCBI because of its experience in biomedical database
maintenance, and because, as part of the NIH, it could establish a research program in
computational biology. The role of the NCBI is to develop new information technologies to aid
our understanding of the molecular and genetic processes that underlie health and disease. Its
specific aims include the creation of automated systems for storing and analyzing biological
information; the development of advanced methods of computer-based information processing;
the facilitation of user access to databases and software; and the co-ordination of efforts to
gather biotechnology information worldwide. Since 1992, one of the principal tasks of the
NCBI has been the maintenance of GenBank, the NIH DNA sequence database. Groups of
annotators create sequence data records from the scientific literature and, together with
information acquired directly from authors, data are exchanged with the international
nucleotide databases, EMBL and DDBJ. At the NCBI the ENTREZ facility (a search tool of
NCBI) was developed to allow retrieval of molecular biology data and bibliographic citations
from NCBI’s integrated databases.
9. How are biological databases categorised on the type of information they contain?
Give example of each category of database.
Answer: Primary databases: contain original biological data. They are archives of raw
sequence or structural data submitted by the scientific community eg. Gen bank, Protein
database (PDB).
Specialized databases: cater to a particular research interest. Eg: Flybase, HIV sequence
database, and ribosomal database project specialize in a particular organism or a particular type
of data.
10. What information would you obtain from the following databases.
i. UniProt
ii. dbSTS
iii. UniGene
iv. dbHTGS
Answer: UniProt : provides highly curated protein sequence and function information.
dbSTS:database of sequenced tagged sites, STS are short (200 to 500 base pair) DNA sequence
that occurs once in the genome and whose location and base sequence are known.
UniGene: UniGene computationally identifies transcripts from the same locus; analyzes
expression by tissue, age, and health status; and reports related proteins (protEST) and clone
resources
BLAST is an open source program and anyone can download and change the program code.
This has also given rise to a number of BLAST derivatives; WU-BLAST is probably the most
commonly used [Altschul and Gish, 1996]. BLAST is highly scalable and comes in a number
of different computer platform configurations which makes usage on both small desktop
computers and large computer clusters possible.
Examples of BLAST usage BLAST can be used for a lot of different purposes. A few of them
are mentioned below.
Looking for species. If you are sequencing DNA from unknown species, BLAST may
help identify the correct species or homologous species.
Looking for domains. If you BLAST a protein sequence (or a translated nucleotide
sequence) BLAST will look for known domains in the query sequence.
Looking at phylogeny. You can use the BLAST web pages to generate a phylogenetic
tree of the BLAST result.
Mapping DNA to a known chromosome. If you are sequencing a gene from a known
species but have no idea of the chromosome location, BLAST can help you. BLAST
will show you the position of the query sequence in relation to the hit sequences.
Annotations. BLAST can also be used to map annotations from one organism to another
or look for common genes in two related species.
Searching for homology
Most research projects involving sequencing of either DNA or protein have a requirement for
obtaining biological information of the newly sequenced and maybe unknown sequence. If the
researchers have no prior information of the sequence and biological content, valuable
information can often be obtained using BLAST. The BLAST algorithm will search for
homologous sequences in predefined and annotated databases of the user’s choice. In an easy
and fast way, the researcher can gain knowledge of gene or protein function and find
evolutionary relations between the newly sequenced DNA and well-established data. After the
BLAST search the user will receive a report specifying found homologous sequences and their
local alignments to the query sequence.
Variants of BLAST
BLAST-N: compares nucleotide sequence with nucleotide sequences
BLAST-P: compares protein sequences with protein sequences
BLAST-X: Compares nucleotide sequences against the protein sequences
tBLAST-N: compares the protein sequences against the six frame translations of
nucleotide sequences
tBLAST-X: Compares the six frame translations of nucleotide sequence against the six
frame translations of protein sequences.
It was the first database similarity search tool developed, preceding the development of
BLAST. FASTA is another sequence alignment tool which is used to search similarities
between sequences of DNA and proteins. FASTA uses a “hashing” strategy to find matches
for a short stretch of identical residues with a length of k. The string of residues is known as
ktuples or ktups, which are equivalent to words in BLAST, but are normally shorter than the
words. Typically, a ktup is composed of two residues for protein sequences and six residues
for DNA sequences. The query sequence is thus broken down into sequence patterns or words
known as k-tuples and the target sequences are searched for these k-tuples in order to find the
similarities between the two. FASTA is a fine tool for similarity searches.
3. Differentiate between local and global alignment. Which alignment you would
prefer for highly similar sequences.
Answer: Global alignment
Sequences are assumed to be generally similar over their entire length. Alignment is
carried out to find best possible alignment over entire length of sequence
Method works best for closely related sequences of approximately the same length
Works less well for divergent sequences or sequences of very different lengths.
Local alignment
Looks for local regions of similarity and aligns these regions without regard for the rest of
the sequence.
Method good for aligning more divergent sequences with goal of searching for conserved
patterns in DNA or protein sequence.
Definition
BLAST: BLAST is an algorithm for comparing primary biological sequence information like
nucleotide or amino acid sequences.
FASTA: FASTA is a DNA and protein sequence alignment software package.
Stands for
BLAST: BLAST stands for Basic Local Alignment Search Tool.
FASTA: FASTA is short of “fast-all” or “FastA”.
Global/Local Alignment
BLAST: BLAST uses local sequence alignment.
FASTA: FASTA uses local sequence alignment first and then it extends the similarity search
to global alignment.
Local Sequence Alignment
BLAST: BLAST searches similarities in local alignment by comparing individual residues in
the two sequences.
FASTA: FASTA searches similarities in local alignments by comparing sequence patterns or
words.
Type of Search
BLAST: BLAST is better for similarity searching in closely matched or locally optimal
sequences.
FASTA: FASTA is better for similarity searching in less similar sequences.
Type of Work
BLAST: BLAST works best for protein searches.
FASTA: FASTA works best for nucleotide searches.
Gaps in Query Sequence
BLAST: In BLAST, gaps between query and target sequences are not allowed.
FASTA: In FASTA, gaps are allowed.
Sensitivity
BLAST: BLAST is a sensitive bioinformatics tool.
FASTA: FASTA is more sensitive than BLAST.
Speed
BLAST: BLAST is speedier than FASTA.
FASTA: FASTA is less speedy toll when compared to BLAST.
Developers
BLAST: BLAST was designed by Stephen Altschul, Webb Miller, Warren Gish, Eugene
Myers and David J. Lipman at the National Institute of Health in 1990.
FASTA: FASTA was developed by David J. Lipman and William R. Pearson in 1985.
Significance
BLAST: At present, BLAST is the most widely used bioinformatics tools for similarity
searches.
FASTA: The legacy of FASTA is the FASTA format, which is now ubiquitous in
bioinformatics.
6. Differentiate primary and secondary database.
Answer: Database is a collection of related data arranged in a way suitable for adding, locating,
removing and modifying the data
I. Primary database
1. It is also known as archival database
2. Databases consisting of data derived experimentally such as nucleotide sequences and
three-dimensional structures are known as primary databases.
3. It contains the original experimental results are directly submitted into database by
researchers across the globe
4. Primary database has high levels of redundancy or duplication of data
5. Example: Gen bank, DDBJ, PDB
9. For protein alignment, the situation is far more complex, support your
statement.
Answer:
Although there are twice as many possible transversions, because of the molecular
mechanisms by which they are generated, transition mutations are generated at higher
frequency than transversions. As well, transitions are less likely to result in amino acid
substitutions (due to "wobble") and are therefore more likely to persist as "silent substitutions"
in populations as single nucleotide polymorphisms (SNPs).
The values in the table represent log odds scores of the residues calculated from the multiple
alignment. To construct a matrix, raw frequencies of each residue at each column position from
a multiple alignment are first counted. The frequencies are normalized by dividing positional
frequencies of each residue by overall frequencies so that the scores are length and composition
independent. The values are converted to the probability values by taking to the logarithm
(normally to the base of 2). In this way, the matrix values become log odds scores of residues
occurring at each alignment position. In this matrix, a positive score represents identical residue
or similar residue match; a negative score represents a non-conserved sequence match.
This constructed matrix can be considered a distilled representation for the entire group of
related sequences, providing a quantitative description of the degree of sequence conservation
at each position of a multiple alignment. The probabilistic model can then be used like a single
sequence for database searching and alignment or can be used to test how well a particular
target sequence fits into the sequence group. For example, given the matrix shown in Figure
1,which is derived from a DNA multiple alignment, one can ask the question, how well does
the new sequence AACTCG fit into the matrix? To answer the question, the probability values
of the sequence at respective positions of the matrix can be added up to produce the sum of the
scores (Fig. 2).
Fig. 2. Example of calculation of how well a new sequence fits into the PSSM produced in Fig. 1. The matching
positions for the new sequence AACTCG are circled in the matrix.
In this case, the total match score for the sequence is 6.33. Because the matrix
values have been taken to the logarithm to the base of 2, the score can be interpreted as the
probability of the sequence fitting the matrix as 26.33 , or 80 times more likely than by random
chance. Consequently, the new sequence can be confidently classified as a member of the
sequence family.
The probability values in a PSSM depend on the number of sequences used to compile the
matrix. Because the matrix is often constructed from the alignment of an insufficient number
of closely related sequences, to increase the predictive power of the model, a weighting scheme
similar to the one used in the Clustal algorithm is used that down weights overrepresented,
closely related sequences and upweights underrepresented and divergent ones, so that more
divergent sequences can be included. Application of such a weighting scheme makes the matrix
less biased and able to detect more distantly related sequences.
12. Why sequence comparison is important?
Answer: Sequence alignment is the procedure of comparing two (pair-wise alignment) or more
multiple sequences by searching for a series of individual characters or patterns that are in the
same order in the sequences.
Pair-wise alignment: compare two sequences
Multiple sequence alignment: compare more than two sequences
Local alignment concentrates on finding stretches of sequences with high level of matches
Finds local regions with the highest level of similarity between the two sequences and
aligns these regions without considering the alignment of rest of the sequence regions
Suitable for aligning more divergent sequences
Used for finding out conserved patterns in DNA or protein sequences
13. Align the following two sequences using Needleman -Wunsch method of pairwise
sequence alignment. Using -2 as a gap penalty, -3 as a mismatch penalty, and 2 as
the score for a match.
ACTGATTCA
ACGCATCA
Answer:
Types of BLAST
Nucleotide-nucleotide BLAST (blastn) – This program, given a DNA query, returns the most
similar DNA sequences from the DNA database that the user specifies.
Protein-protein BLAST (blastp) – This program, given a protein query, returns the most
similar protein sequences from the protein database that the user specifies.
Position-Specific Iterative BLAST (PSI- BLAST) (blastpgp) – This program is used to find
distant relatives of a protein.
Large numbers of query sequences (megablast) - When comparing large numbers of input
sequences via the command-line BLAST, “megablast” is much faster than running BLAST
multiple times.
Of these programs, BLASTn and BLASTp are the most commonly used because they use direct
comparisons, and do not require translations. However, since protein sequences are better
conserved evolutionarily than nucleotide sequences, tBLASTn, tBLASTx, and BLASTx,
produce more reliable and accurate results when dealing with coding DNA.
18. Consider the two following alignments to compute best sequence match:
V I T K LGT CV GS V I T K LGT CV G S
V T K G T C V S V I T T CV GS
Match = 3 Gap open penalty = -4 Gap extension penalty = -1
Unit 3
Scoring Matrices: Basic concept of a scoring matrix, Similarity and distance matrix,
Substitution matrices: Matrices for nucleic acid and proteins sequences, PAM and
BLOSUM series, Principles based on which these matrices are derived and Gap Penalty;
Predictive Method using Nucleotide Sequence: Introduction, Marking repetitive DNA,
Database search, Codon bias detection, detecting functional site in DNA.
Similarity matrices are used in sequence alignment. Higher scores are given to more-similar
characters, and lower or negative scores for dissimilar characters.
Nucleotide similarity matrices are used to align nucleic acid sequences. Because there are
only four nucleotides commonly found in DNA (Adenine (A), Cytosine (C), Guanine (G)
and Thymine (T)), nucleotide similarity matrices are much simpler than protein similarity
matrices. For example, a simple matrix will assign identical bases a score of +1 and non-
identical bases a score of −1. A more complicated matrix would give a higher score to
transitions (changes from a pyrimidine such as C or T to another pyrimidine, or from a
purine such as A or G to another purine) than to transversions (from a pyrimidine to a purine
or vice versa). The match/mismatch ratio of the matrix sets the target evolutionary distance
(States et al. 1991 METHODS - A companion to Methods in Enzymology 3:66-70); the
+1/−3 DNA matrix used by BLASTN is best suited for finding matches between sequences
that are 99% identical; a +1/−1 (or +4/−4) matrix is much more sensitive as it is optimal
for matches between sequences that are about 70% identical.
Amino acid similarity matrices are more complicated, because there are 20 amino acids
coded for by the genetic code. Therefore, the similarity matrix for amino acids contains 400
entries (although it is usually symmetric). The first approach scored all amino acid changes
equally. A later refinement was to determine amino acid similarities based on how many
base changes were required to change a codon to code for that amino acid. This model is
better, but it doesn't take into account the selective pressure of amino acid changes. Better
models took into account the chemical properties of amino acids.
One approach has been to empirically generate the similarity matrices. The Dayhoff method
used phylogenetic trees and sequences taken from species on the tree. This approach has
given rise to the PAM series of matrices. PAM matrices are labelled based on how many
nucleotide changes have occurred, per 100 amino acids. While the PAM matrices benefit
from having a well understood evolutionary model, they are most useful at short
evolutionary distances (PAM10 - PAM120). At long evolutionary distances, for example
PAM250 or 20% identity, it has been shown that the BLOSUM matrices are much more
effective.
The BLOSUM series were generated by comparing a number of divergent sequences. The
BLOSUM series are labeled based on how much entropy remains unmutated between all
sequences, so a lower BLOSUM number corresponds to a higher PAM number.
Computational algorithms are used to produce and analyse the MSAs due to the difficulty
and intractability of manually processing the sequences given their biologically relevant
length. MSAs require more sophisticated methodologies than pairwise alignment because
they are more computationally complex. Most multiple sequence alignment programs
use heuristic methods rather than global optimization because identifying the optimal
alignment between more than a few sequences of moderate length is prohibitively
computationally expensive. On the other hand, heuristic methods generally fail to give
guarantees on the solution quality, with heuristic solutions shown to be often far below the
optimal solution on benchmark instances.
4. Describe Gap Penalty. Why are gap penalties introduced during sequence
alignment? What are affine gap penalties?
Answer: A Gap penalty is a method of scoring alignments of two or more sequences. Gap
penalties are used during sequence alignment. Gap penalties contribute to the overall score
of alignments, and therefore, the size of the gap penalty relative to the entries in
the similarity matrix affects the alignment that is finally selected. Selecting a higher gap
penalty will cause less favourable characters to be aligned, to avoid creating as many gaps
as possible. When aligning sequences, introducing gaps in the sequences can allow an
alignment algorithm to match more terms than a gap-less alignment can. However,
minimizing gaps in an alignment is important to create a useful alignment. Too many gaps
can cause an alignment to become meaningless. Gap penalties are used to adjust alignment
scores based on the number and length of gaps. The five main types of gap penalties are
constant, linear, affine, convex, and profile based.
Gap penalties increase the quality of an alignment non-homologous sequences are not
aligned. Permitting the insertion of arbitrarily many gaps might lead to high scoring
alignments of nonhomologous sequences. Penalizing gaps forces alignments to have
relatively few gaps. Affine gap penalty has both constant and proportional contributions.
Constant gap penalty is independent of length of gap A proportional penalty is proportional
to the length of the gap. So Affine gap penalty include penalty for opening a gap as well as
extending the gap.
RefSeq is limited to major organisms for which sufficient data are available (121,461
distinct "named" organisms as of July 2022), while GenBank includes sequences for
any organism submitted (approximately 504,000 formally described species).
10. Differentiate PAM and BLOUSM. Which PAM matric would you use to align
highly similar sequences?
Answer: Substitution matrices are used to score aligned positions in a sequence
alignment procedure, usually of amino acids or nucleotide sequences. In addition to BLOSUM
matrices, a previously developed scoring matrix can be used for sequence similarity analysis.
This is known as a PAM. The two results in the same scoring outcome but use differing
methodologies. BLOSUM looks directly at mutations in motifs of related sequences while
PAM's extrapolate evolutionary information based on closely related sequences.
Since both PAM and BLOSUM are different methods for showing the same scoring
information, the two can be compared but due to the very different method of obtaining this
score, a PAM100 does not equal a BLOSUM100.
11. Compute the matching position of nucleotides using PSSM for the given
sequences:
Seq1: ATGTCG, Seq2: AAGACT, Seq3: TACTCA, Seq4: CGGAGG, and Seq5:
AACCTG
Figure 1. Example of construction of a PSSM from a multiple alignment of nucleotide sequences. The process
involves counting raw frequencies of each nucleotide at each column position, normalization of the frequencies
by dividing positional frequencies of each nucleotide by overall frequencies and converting the values to log odds
scores.
Figure 2. Example of calculation of how well a new sequence fits into the PSSM produced in Figure 1. The
matching positions for the new sequence AACTCG are circled in the matrix.
Unit 4
Cladogram is a tree-like diagram which is drawn using lines. The nodes of a cladogram
represent the splitting of two groups from a common ancestor. Clades are summarized at
the ends of the lines and the members of a particular clade share similar characteristics.
Clades are built using molecular differences instead of morphological characteristics.
However, cladograms can be constructed using the correct morphological and behavioral
data as well.
Figure. A Primate Cladogram
Figure. A dendogram
A more accurate estimation of the evolutionary relationship among species is now possible
in a molecular phylogenetic analysis using gene sequencing data. Also, the Linnaean
classification (based on relatedness in obvious physical traits) of newly evolved species can
be done using molecular phylogenetic analysis.
Phylogenetic analysis can be useful in comparative genomics, which studies the relationship
between genomes of different species. In this context, one major application is gene
prediction or gene finding, which means locating specific genetic regions along a genome.
Taxonomy does not reveal anything about the Phylogeny reveals the shared
shared evolutionary history of organisms. evolutionary history.
A different type of clock is used, a Molecular Clock. Instead of measuring seconds, minutes,
and hours, the molecular clock measures the constant rate of change in an organism's genome
(DNA or protein sequences of a specific gene) over time. This constant rate of change occurs
randomly among different species of organisms such as animals, plants, fungi, and viruses. By
measuring these changes, scientists can then create phylogenetic trees representing a species
that evolved or diverged from another long ago.
The molecular clock hypothesis states that DNA and protein sequences evolve at a rate that is
relatively constant over time and among different organisms. The molecular clock is a
figurative term for a technique that uses the mutation rate of biomolecules to deduce the
time in prehistory when two or more life forms diverged. The biomolecular data used for such
calculations are usually nucleotide sequences for DNA, RNA, or amino acid sequences
for proteins. The benchmarks for determining the mutation rate are often fossil or
archaeological dates. The molecular clock was first tested in 1962 on the hemoglobin protein
variants of various animals and is commonly used in molecular evolution to estimate times
of speciation or radiation. It is sometimes called a gene clock or an evolutionary clock that
explain importance of phylogenetics. The notion of the existence of a so-called "molecular
clock" was first attributed to Émile Zuckerkandl and Linus Pauling who, in 1962, noticed that
the number of amino acid differences in hemoglobin between different lineages changes
roughly linearly with time, as estimated from fossil evidence. They generalized this
observation to assert that the rate of evolutionary change of any specified protein was
approximately constant over time and over different lineages (known as the molecular clock
hypothesis).
When an organism inherits genetic material from the previous generation, the change occurs
steadily, and the genes are said to be neutral. They are neither disadvantageous nor
advantageous, meaning they do not inhibit natural selection or fitness but are rather due
to genetic drift. Different genes containing different nucleotide substitutions are studied to
determine the rate at which the sequences of the genes have been evolving. This occurrence
happens over a timeframe of millions of years as the genes are passed down and altered from
one generation to the next.
The molecular clock measures the number of random mutations of an organism's gene (DNA
or protein sequences) at a relatively constant rate over a specific timeframe. It is calibrated with
fossil records and geological timescales. It measures how long-ago different organisms were
on Earth and when the divergence of a new species (animal, plant, virus, fungi) happened.
Forensics: Phylogenetics is used to assess DNA evidence presented in court cases to inform
situations, e.g., where someone has committed a crime, when food is contaminated, or where
the father of a child is unknown.
Bioinformatics and computing: Many of the algorithms developed for phylogenetics have
been used to develop software in other fields.
The diagram is known as a phylogenetic tree. Phylogenetic analysis is important for gathering
information on biological diversity, genetic classifications, as well as learning developmental
events that occur during evolution.
With advancements in genetic sequencing techniques, phylogenetic analysis now involves the
sequence of a gene to understand the evolutionary relationships among species. DNA being the
hereditary material can now be sequenced easily, rapidly, and cost-effectively, and the data obtained
from genetic sequencing is very informative and specific.
Also, morphological estimates can be used to infer evolutionary developments, especially in cases
where genetic material is not available (fossils).
When 2 sequences found in 2 organisms are very similar, we assume that they have derived
from one ancestor.
The sequences alignment reveal which positions are conserved from the ancestor sequence.
The progressive multiple alignment of a group of sequences, first aligns the most similar
pair.
Then it adds the more distant pairs.
The alignment is influenced by the “most similar” pairs and arranged accordingly, but it
does not always correctly represent the evolutionary history of the occurred changes.
Not all phylogenetic methods work this way.
Most phylogenetic methods assume that each position in a sequence can change
independently from the other positions.
Gaps in alignments represent mutations in sequences such as: insertion, deletion, genetic
rearrangements
Gaps are treated in various ways by the phylogenetic methods. Most of them ignore gaps.
Another approach to treat gaps is by using sequences similarity scores as the base for the
phylogenetic analysis, instead of using the alignment itself, and trying to decide what
happened at each position.
The similarity scores based on scoring matrices (with gaps scores) are used by the
DISTANCE methods.
Biologists analyze different characteristics of organisms using different analytical tools such
as parsimony, distance, likelihood and bayesian methods, etc. They consider many
characteristics of organisms including morphological, anatomical, behavioral, biochemical,
molecular and fossil characteristics to construct phylogenetic trees.
A rooted phylogenetic tree is a type of phylogenetic tree that describes the ancestry of a
group of organisms. Importantly, it is a directed tree, starting from a unique node known as
the recent common ancestor. Basically, the roots of the phylogenetic tree describe this recent
common ancestor.
Figure 1: A Rooted Phylogenetic Tree
However, this recent common ancestor is an extra and distantly related organism to the group
of organisms used to build up the phylogenetic tree. But it serves as the parent of all organisms
in the group.
However, the same data in the rooted phylogenetic tree can be used to generate an unrooted
phylogenetic tree as well. It is created by omitting the root.
The simplest clustering method is UPGMA, which builds a tree by a sequential clustering
method. Given a distance matrix, it starts by grouping two taxa with the smallest pairwise
distance in the distance matrix. A node is placed at the midpoint or half distance between them.
It then creates a reduced matrix by treating the new cluster as a single taxon. The distances
between this new composite taxon and all remaining taxa are calculated to create a reduced
matrix. The same grouping process is repeated and another newly reduced matrix is created.
The iteration continues until all taxa are placed on the tree. The last taxon added is considered
the outgroup producing a rooted tree.
The basic assumption of the UPGMA method is that all taxa evolve at a constant rate and that
they are equally distant from the root, implying that a molecular clock is in effect. However,
real data rarely meet this assumption. Thus, UPGMA often produces erroneous tree topologies.
However, owing to its fast speed of calculation, it has found extensive usage in clustering
analysis of DNA microarray data.
The algorithm is based on optimality criteria that select the tree with a minimum amount of
residual (difference between actual and expected summed evolutionary distance). The
algorithm estimates the total branch length (distance) and clusters in accordance with taxa pair
in order to determine the unrooted tree with minimum distance.
1. The FM algorithm does not assume a constant rate of evolution, which is quite realistic.
2. Optimized tree can be selected out of several possible trees.
3. Its demerit is underestimation of very long evolutionary distances, because it ignores
homoplasies (absence of a character in the common ancestor, though it is being shared
by a group of related species originating from the common ancestor).
4. It ignores the role of intermediate ancestor(s); hence, consistency of evolution is not the
basic assumption.
5. Outgroup is added to the sequences in order to generate a rooted tree using the FM
method.
It needs to be emphasized that the resulting tree is an approximate tree and does not have the
rigor of a formally constructed phylogenetic tree. Nonetheless, the tree can be used as a guide
for directing realignment of the sequences. For that reason, it is often referred to as a guide
tree. According to the guide tree, the two most closely related sequences are first re-aligned
using the Needleman–Wunsch algorithm. To align additional sequences, the two already
aligned sequences are converted to a consensus sequence with gap positions fixed. The
consensus is then treated as a single sequence in the subsequent step.
In the next step, the next closest sequence based on the guide tree is aligned with the consensus
sequence using dynamic programming. More distant sequences or sequence profiles are
subsequently added one at a time in accordance with their relative positions on the guide tree.
After realignment with a new sequence using dynamic programming, a new consensus is
derived, which is then used for the next round of alignment. The process is repeated until all
the sequences are aligned as shown below.
Figure. Schematic of a typical progressive alignment procedure (e.g., Clustal). Angled wavy lines represent
consensus sequences for sequence pairs A/B and C/D. Curved wavy lines represent a consensus for A/B/C/D.
Probably the most well-known progressive alignment program is Clustal. Some of its important
features are introduced next. Clustal (www.ebi.ac.uk/clustalw/) is a progressive multiple
alignment program available either as a stand-alone or on-line program. The stand-alone
program, which runs on UNIX and Macintosh, has two variants, ClustalW and ClustalX. The
W version provides a simple text-based interface, and the X version provides a more user-
friendly graphical interface.
Steps:
1. Start with the most similar sequence.
2. Align the new sequence to each of the previous sequences.
3. Create a distance matrix/function for each sequence pair.
4. Create a phylogenetic “guide tree” from the matrices, placing the sequences at
the terminal nodes.
5. Use the guide tree to determine the next sequence to be added to the alignment.
6. Preserve gaps.
7. Go back to step 1.
Progressive MSA is one of the fastest approaches, considerably faster than the adaptation
of pair-wise alignments to multiple sequences, which can become a very slow process for more
than a few sequences.
One major disadvantage, however, is the reliance on a good alignment of the first two
sequences. Errors there can propagate throughout the rest of the MSA. An alternative approach
is iterative MSA (see below).
Distance-matrix methods
Distance-matrix methods are rapid approaches that measure the genetic distances
between sequences. After having aligned the sequences through multiple sequence alignment,
the proportion of mismatched positions is calculated. From this, a matrix is constructed that
describes the genetic distance between each sequence pair. In the resulting phylogenetic tree,
closely related sequences are found under the same interior node, and the branch lengths
represent the observed genetic distances between sequences.
Neighbor joining (NJ), and Unweighted Pair Group Method with Arithmetic Mean (UPGMA)
are two distance-matrix methods. NJ and UPGMA produce unrooted and rooted trees,
respectively. NJ is a bottom-up clustering algorithm. The main advantages of NJ are its rapid
speed, regardless of dataset size, and that it does not assume an equal rate of evolution amongst
all lineages. Despite this, NJ only generates one phylogenetic tree, even when there are several
possibilities. UPGMA is an unweighted method, meaning that each genetic distance
contributes equally to the overall average. UPGMA is not a particularly popular method as it
makes various assumptions, including the assumption that the evolutionary rate is equal for all
lineages.
Maximum parsimony
Maximum parsimony attempts to reduce branch length by minimizing the number of
evolutionary changes required between sequences. The optimal tree would be the shortest tree
with the fewest mutations. All potential trees are evaluated, and the tree with the least amount
of homoplasy, or convergent evolution, is selected as the most likely tree. Since the most-
parsimonious tree is always the shortest tree, it may not necessarily best represent the
evolutionary changes that have occurred. Also, maximum parsimony is not statistically
consistent, leading to issues when drawing conclusions.
Maximum likelihood
Despite being slow and computationally expensive, maximum likelihood is the most used
phylogenetic method used in research papers, and it is ideal for phylogeny construction from
sequence data. For each nucleotide position in a sequence, the maximum likelihood algorithm
estimates the probability of that position being a particular nucleotide, based on whether the
ancestral sequences possessed that specific nucleotide. The cumulative probabilities for the
entire sequence are calculated for both branches of a bifurcating tree. The likelihood of the
whole tree is provided by the sum of the probabilities of both branches of the tree. Maximum
likelihood is based on the concept that each nucleotide site evolves independently, enabling
phylogenetic relationships to be analyzed at each site. The maximum likelihood method can be
carried out in a reasonable amount of time for four sequences. If more than four sequences are
to be analyzed, then basic trees are constructed for the initial four sequences, and further
sequences are subsequently added, and maximum likelihood is recalculated. Bias can be
introduced into the calculation as the order in which the sequences are added, and the initial
sequence used, play pivotal roles in the outcome of the tree. Bias can be avoided by repeating
the entire process multiple times at random so that the majority rule consensus tree can be
selected.
Bayesian inference
Bayesian inference methods assume phylogeny by using posterior probabilities of phylogenetic
trees. A posterior probability is generated for each tree by combining its prior probability with
the likelihood of the data. A phylogeny is best represented by the tree with the highest posterior
probability. Not only does Bayesian inference produce results that can be easily interpreted; it
can also incorporate prior information and complex models of evolution into the analysis, as
well as accounting for phylogenetic uncertainty.
14. Produce a phylogenetic tree using UPGMA for the given matrix.
Unit 5
Protein identification based on composition, Physical properties based on sequence, Motif and
pattern, Secondary structure (Statistical method: Chou Fasman and GOR method, Neural
Network and Nearest neighbor method) and folding classes, specialized structure or features,
Tertiary structures (Homology Modeling); Structure visualization methods (RASMOL,
CHIME etc.); Protein Structure alignment and analysis. Application of bioinformatics in drug
discovery and drug designing.
There are three major theoretical methods for predicting the structure of proteins: comparative
modelling, fold recognition, and ab initio prediction.
Comparative modelling
Comparative modelling exploits the fact that evolutionarily related proteins with similar
sequences, as measured by the percentage of identical residues at each position based on an
optimal structural superposition, have similar structures. The similarity of structures is very
high in the so-called ``core regions'', which typically are comprised of a framework of
secondary structure elements such as alpha-helices and beta-sheets. Loop regions connect these
secondary structures and generally vary even in pairs of homologous structures with a high
degree of sequence similarity.
Ab initio prediction
The ab initio approach is a mixture of science and engineering. The science is in understanding
how the three-dimensional structure of proteins is attained. The engineering portion is in
deducing the three-dimensional structure given the sequence. The biggest challenge with
regards to the folding problem is with regards to ab initio prediction, which can be broken down
into two components: devising a scoring function that can distinguish between correct (native
or native-like) structures from incorrect (non-native) ones, and a search method to explore the
conformational space. In many ab initio methods, the two components are coupled together
such that a search function drives, and is driven by, the scoring function to find native-like
structures.
Currently there does not exist a reliable and general scoring function that can always drive a
search to a native fold, and there is no reliable and general search method that can sample the
conformation space adequately to guarantee a significant fraction of near natives (< 3.0
angstroems RMSD from the experimental structure).
Some methods for ab initio prediction include Molecular Dynamics (MD) simulations of
proteins and protein-substrate complexes provide a detailed and dynamic picture of the nature
of inter-atomic interactions with regards to protein structure and function; Monte Carlo (MC)
simulations that do not use forces but rather compare energies, via the use of Boltzmann
probabilities; Genetic Algorithms which tries to improve on the sampling and the convergence
of MC approaches, and exhaustive and semi-exhaustive lattice-based studies which are based
on using a crude/approximate fold representation (such as two residues per lattice point) and
then exploring all or large amounts of conformational space given the crude representation.
Bioinformatics has made it possible to sequence the genome of various organisms and there
are almost hundred organisms whose genome has been mapped so far. The databases of these
organisms are increasing in size day by day because every day they get new information about
any organism. As other fields of science have made use of bioinformatics, similarly,
pharmaceutical industry has also adopted bioinformatics and genomics for the drug targets and
drug discovery.
The understanding of molecular biology has made it possible to design and develop the drugs.
In the beginning, whole animals were used to test the synthetic organic molecules in the organ
preparation. In the recent years, bioinformatics has made it easy for the researchers that they
can now easily target the molecules in the in vitro environment. Now the screening of newly
developed compounds can be done against the molecules of the proteins or genetically
modified cells and it gives very efficient results. This way of drug development has eased the
identification of the disease in an organism.
Today the pharmaceutical industries are capable of developing such drugs which are able to
target 500 gene products. A gene product is the biochemical material which is produced from
the gene expression. It is either RNA or protein. Now scientists have mapped the whole genome
of humans which consists of 30,000 to 40,000 genes that is why it has provided the scientists
to discover more drugs by observing the gene expression. Mapping of the whole human
genome has enabled the scientists to have more drug targets but still it is a big challenge to get
the target which will be show successful results. It is the science of bioinformatics which
focuses the attention of the scientists on the target validation instead of target identification.
There are a lot of factors which should be observed during the drug target like, nucleotide and
protein sequences, functional prediction, mapping information, data of gene and protein
expression. Bioinformatics tools help in collecting the information about all these factors.
These tools accumulate the information in the form of databases. These databases save time,
money and efforts of the researchers and save the information in the form of groups and
subgroups. The information in the databases also gives the knowledge of molecules.
There are two ways through which pharmaceutical industries develop drugs, either broad
spectrum drugs or narrow spectrum drugs. Broad spectrum drugs can be used in the form of
antibiotics to kill bacteria which have pathogenic characteristics. But the positive point is that
these drugs do not harm the human health. Role of bioinformatics in the drug target is in this
sense that now DNA and protein chips are used. The chips are very helpful in determining the
expression of proteins or DNA in cells.
The clustering algorithms present in bioinformatics tools help the researchers to make
comparison of the gene expression data and distinguish the diseased cells from healthy ones.
These algorithms also enable the researchers to observe the function of the gene or protein
during the process of disease.
Bioinformatics is a very helpful field as it makes such tools which provide the full information
about the proteins and gene families and researchers can use this information for various
applications for example drug target and drug discovery.
Homology modeling (comparative modeling) is one of the computational structure prediction methods
that are used to determine 3D structure of a protein from its amino acid sequence based on its template.
The basis for homology modeling are two major observations. First protein 3D structure is particularly
determined by its amino acid sequence. Second the structure of proteins is more conserved, and the
change happens at a much slower rate in relative to the sequence during evolution. As a result, similar
sequences fold into identical structures and even sequences with low relation take similar structures.
Homology modeling is the most accurate of the computational structure prediction methods [7]. 3D
structure predictions made by computational methods like de novo prediction and threading were
compared to homology modeling using Root Mean Square Deviation (RMSD) as a criterion. Homology
modeling was found to give 3D structures with the highest accuracy. Furthermore, it is a protein 3D
structure prediction method that needs less time and lower cost with clear steps. Thus, homology
modeling is widely used for the generation of 3D structures of proteins with high quality. This has
changed the ways of docking and virtual screening methods that are based on structure in the drug
discovery process.
7. What is the best protein structure prediction method according to you? Justify
your answer in comparison to other methods.
Answer: AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D
structure from its amino acid sequence. It regularly achieves accuracy competitive with
experiment.
AlphaFold was the top-ranked protein structure prediction method by a large margin, producing
predictions with high accuracy. While the system still has some limitations, the CASP results
suggest AlphaFold has immediate potential to help us understand the structure of proteins and
advance biological research.
AF2 starts by employing multiple sequence alignments (MSA) with different regions weighted
by importance (Attention). It then uses the Evoformer module to extract information about
interrelationships between protein sequences and template structures. The structure module
treats the protein as a residue gas moved around by the network to generate the protein’s 3D
structure followed by local refinement to provide the final prediction. Assessing the quality of
predictions using the TM-score (a structural similarity metric with values [0,1] whose random
value is 0.3 and values greater than 0.4 indicate fold similarity), AF2 often produced models
with a TM-score greater than 0.9. Going beyond a TM-score of 0.85 indicates that both the
global fold and details are correct. In contrast, MSA-based methods generate an average fold
and plateau around a TM-score of 0.85. The key to why AF2 works is the fact the library of
single domain protein structures is essentially complete.
8. What are the major steps involved in drug discovery process follow identification
of the biological target? Explain.
Answer: The complexity in drug development has increased manifolds over the past 40 years,
requiring preclinical phase of drug development, investigational new drug (IND) application,
and complete clinical testing before marketing approval from the FDA. Generally, new drug
applications (NDAs) or biologics license applications (BLA) are reviewed comprehensively
before approval, and then drug performance is resubmitted to regulatory agencies for post-
marketing studies. The overarching goal is to bring more efficient and safer treatments to the
patients as quickly as possible after a thorough medical evaluation.
Phases And Stages
There are five critical steps in the U.S. FDA drug development process, including many phases
and stages within each of them. We will discuss each phase and different stages of drug
development to develop an in-depth understanding of the entire process. The phases of drug
development are –
Step 1: Discovery and Development
Step 2: Preclinical Research
Step 3: Clinical Development
Step 4: FDA Review
Step 5: FDA Post-Market Safety Monitoring.
Step 1: Discovery & Development
Drug discovery research is how new medications are discovered. Historically, drug discovery,
design, and development mostly started with identifying active ingredients from traditional
medicines or purely by chance. Later, classical pharmacology was used to investigate chemical
libraries including small molecules, natural products, or plant extracts, and find those with
therapeutic effects. Since human DNA was sequenced, reverse pharmacology has found
remedies to existing diseases through modern testing. Disease processes, molecular compound
tests, existing treatments with unanticipated effects, and new technologies spur drug discovery
timeline through the cycle below. Today the steps in drug discovery and development involve
screening hits, iterative medicinal chemistry, and optimization of hits to reduce potential drug
side effects (increasing affinity and selectivity). Efficacy or potency, metabolic stability (half-
life), and oral bioavailability are also improved in these steps of the drug development process.
Hit To Lead
In the Hit to Lead (H2L) process, small molecule hits from an HTS are evaluated and optimized
in a limited way into lead compounds. These compounds then move on to the lead optimization
process.
Lead Optimization
In the lead optimization (LO) process, the lead compounds discovered in the H2L process are
synthesized and modified to improve potency and reduce side effects. Lead optimization
conducts experimental testing using animal efficacy models and ADMET tools, designing the
drug candidate.
In Silico Assays
In silico assays are test systems or biological experiments performed on a computer or via
computer simulation. These are expected to become increasingly popular with the ongoing
improvements in computational power, and behavioral understanding of molecular dynamics
and cell biology.
Drug Delivery
New drug delivery methods include oral, topical, membrane, intravenous, and inhalation. Drug
delivery systems are used for targeted delivery or controlled release of new drugs.
Physiological barriers in animal or human bodies may prevent drugs from reaching the targeted
area or releasing when they should. The goal is to prevent the drug from interacting with
healthy tissues while still being effective.
Oral: Oral delivery of medications is reliable, cost-effective, and convenient for
patients. Oral drug delivery may not monitor precise dosages to the desired area but is
ideal for prophylactic vaccinations and nutritional regimens. Delayed action, stomach
enzyme destruction, absorption inconsistencies, or patients with gastrointestinal issues
or upset can occur, and patients must be conscious during administration.
Topical: Topical drug delivery involves ointments, creams, lotions, or transdermal
patches that deliver a drug by absorption into the body. Topical delivery is more useful
for patient skin or muscular conditions — it is preferred by patients due to non-invasive
delivery and their ability to self-administer the medicine.
Parenteral (IM, SC or LP Membrane): Parenteral drug delivery utilizes bodily
membranes, including intramuscular (IM), intraperitoneal (IP), or subcutaneous or
(SC). It is often used for unconscious patients and avoids epithelial barriers that are
difficult for drugs to cross.
Parenteral (Intravenous): Intravenous injection is one of the fastest drug delivery
absorption methods. IV injection ensures entire doses of drugs enter the bloodstream,
and it is more effective than IM, SC, or LP membrane methods.
Parenteral (Inhalation): Inhalation drug delivery gets the drug rapidly absorbed into
the mucosal lungs, nasal passages, throat, or mouth. Problems with inhalation delivery
include difficulty delivering the optimum dosage due to small mucosal surface areas
and patient discomfort. Pulmonary inhalation drug delivery uses fine drug powders or
macromolecular drug solutions. Lung fluids resemble blood, so they can absorb small
particles easily and deliver them into the bloodstream.
Clinical Trials– Dose Escalation, Single Ascending & Multiple Dose Studies
Proper dosing determines medication effectiveness, and clinical trial examine dose escalation,
single ascending, and multiple dose studies to determine the best patient dosage.
Phase I – Healthy Volunteer Study
This phase is the first time the drug is tested on humans; less than 100 volunteers will help
researchers assess the safety and pharmacokinetics, absorption, metabolic, and elimination
effects on the body, as well as any side effects for safe dosage ranges.
Pharmacokinetic Analysis
Pharmacokinetic Analysis is an experimental trial that determines the theory of how a new drug
behaves in the human body. The volume of distribution, clearance, and terminal half-life are
defined through compartmental modeling.
Blood, Plasma, Urine & Feces Sample Analysis for Drug and Metabolites
Biological samples used in clinical trials include blood, plasma, urine, and feces to determine
and analyze various properties and effects of the drug and its metabolites on humans.
Patient Protection – GCP, HIPAA, & Adverse Event Reporting
Human patients must always be protected during clinical trials, and Good Clinical Practices
(GCP), the Health Insurance Portability and Accountability Act (HIPAA), and adverse event
reporting to IEC/IRB regulates and ensures their safety.
IND Application
IND applications are submitted to the FDA before starting clinical trials. If clinical trials are
ready to be conducted, and the FDA has not responded negatively about the drug, developers
may start the trials.
Orphan Drug
An orphan drug is intended to treat disease so rare that financial sponsors are unwilling to
develop it under standard marketing conditions. These drugs may not be approved quickly or
at all.
Accelerated Approval
New drugs may be granted accelerated approval if there is strong evidence of positive impact
on a surrogate endpoint instead of evidence of impact on actual clinical benefits the drug
provides. Expedition of approval means the medication can help treat severe or life-threatening
conditions.
The percentage of sequence identity also affects the quality of the final model and, therefore,
of the studies you can carry out with the model.
Sequence identity model quality
60-100% Comparable with average resolution NMR. Substrate specificity
30-60% Starting point for site-directed mutagenesis studies
< 30% Serious errors
This seems easy, but boy does it have to take a lot of important factors into account. For
example, researchers must account for whether or not the drug will bind to a specific molecule
or more than one. Binding to other molecules might bring about more adverse effects. How
well the drug will bind to the target molecule needs to be clarified. If it can't bind to the target
well, it might be useless.
The shape and charge of the molecule must be examined in order to determine how this might
affect the way it not only binds to the target, but is absorbed, distributed, metabolized, and
excreted by the body.
The general process of rational drug design is as follows: scientists must first identify a target
for a certain therapeutic need. Next, the structure of the target protein must be defined. Lastly,
the drug’s structure must be designed so that it specifically interacts with the target protein. If
a biological target cannot be identified, an alternative route is taken. Here, we will discuss all
three major rational drug design approaches, including structure-based (as briefly described
above), pharmacophore-based (an alternative approach used when the biologic target cannot
be defined), and new lead generation.
Structure-based approaches
The first port of call-in rational drug design is establishing the 3-dimensional structure of the
known target (often a protein or a nucleic acid). The process of identifying this structure usually
involves producing a working computational model from crystallographic data. However, in
recent years, the use of active ligands to develop models of the active binding site are becoming
more commonly used. From this, scientists work on designing a drug that has the correct
structure required to specifically bind to the target site and induce the biological activity
required for therapeutic effect.
Pharmacophore-Based Approaches
In cases where scientists cannot identify a target, or they cannot determine the structure of a
target, an alternative strategy is used. This approach leverages data on drugs that produce the
same intended effect as the one that is being developed. Generally, it is assumed that drugs that
induce the same biological activity are acting on the same pathways and likely the same
biological target.
Therefore, it is also assumed that to trigger these same biological responses, the new drug must
have the same set of structural features to allow it to successfully interact with the target. This
method, based on the use of analogs to develop a model of the requirements to induce
therapeutic activity is known as the pharmacophore model.
In the pharmacophore model, scientists first determine the three-dimensional features that are
critical for biological activity. Next, the optimal combinations for the structural features are
calculated. Finally, scientists design a drug structure that encompasses this optimal
combination of features.
Pharmacophore approaches have benefited in recent years from advances in the fields of
molecular biology, protein crystallography, and computational chemistry, which have helped
to aid the accuracy of drug design binding affinity predictions.
In the future, we will see a positive correlation between the exponential growth of technology, and the
practicality of in silico drug design, given that this process is grounded in computation and data storage.
This was first apparent in 1990, as our attitude towards processing and data storage increased, allowing
for the entirety of the human genome to be sequenced.
Silico drug design and modeling now pose a very attractive alternative to human and animal testing. It
has even been hypothesized that these models can render testing on any living organism, completely
obsolete. To take it one step further, virtual patients may become conventional, replacing actual patients
or cadavers within the medical school curriculum.
Model complex software (e.g., LEGEND, CrystalDock, GANDI) are used to analyze molecular
interactions of interest, identifying potential binding fragments for future drugs. These virtual fragments
are linked to previously known docking and protein binding sites (receptors). Using a combination of
genetic algorithms and chemical/virtual screening, new drugs can be developed at a much faster rate.
We saw our first case of vaccine rendering via genomic information way back in 2003, a technique
christened as reverse vaccinology. Using genomic information and sequencing performed from
microbial genomes, potential antigens can be discovered. The elucidations of each antigen, and each
corresponding pathogen, have led to a revitalization in vaccine development.
An example of this can be found in the work of Alessandro Sette’s team of immunologists, developing
a vaccine against serogroup B meningococcus. The computational screening of vaccination targets
revealed the T cell epitopes CD8+ and CD4+, which led to a clearer understanding of pathogenetic
immunity towards vaccination, which corresponds to a better understanding of the nature of T cells.
The process of Ligand Design Using in Silico Methodologies
These ligands are first prepared and modeled through computational means, acquiring previously
derived data to obtain a two-dimensional or three-dimensional structure of the parent molecule, and
then relating this data to other ligand interactions. This technique is performed by using prior models
of biological systems to predict the molecular dynamics in a medicinal context.
The next step in ligand design is target predication. This process involves predicting drug targets
through computational extrapolation tools. Through homology modeling, and the digital assaying of
biochemical pathways, the active site and binding site of different drugs can often be found without the
use of any wet chemistry or lab workup. By circumventing this process, both medical and research
centers have saved both time and resources while on their trek towards drug discovery.
The third and final step in drug discovery, just before trials, is In-Silico bioactivity analysis. This is
accomplished through chemical screening, and or virtual screening.
Chemical screening is performed on the lab bench after digital assays narrow down a large number of
chemical compounds to be tested for a specified biological function.
In this form of assay, a large sum of compounds is tested for biological targets such as channel proteins,
hormone receptors, and others. The biological function of these ligands is often studied within test tubes.
For example, a fixed amount of target protein is kept in test tubes, within an appropriate buffer medium.
Once this is done, different compounds selected from a predetermined library are added to the test tubes
at a fixed concentration. Then the inhibitory activity of the compound is measured using an appropriate
test. Compounds showing inhibitory activity are taken as hits and are selected for the next round of
assay to determine their operating inhibitory activity, to find the most potent inhibitor. This is generally
done with the help of robotics.
In virtual screening, the binding affinity of compounds from a data library is calculated for an in-silico
biological target using molecular docking programs. The veracity of these results is evident and can
later be confirmed through experimental procedures.
Within the field of biochemistry, it is often repeated that structure equates to function, and
understanding the structure of ligands, protein targets, and other molecular machines through in silico
methods can aid in the discovery of drug-resistant mechanisms. It will also evolve precision medicine
and can be used to understand the underlying causes of many inherited diseases.
Determining the allostery of these molecules can help us identify alternative drug targeting sites and
advance our foundations of drug design as a whole.
12. Explain the Chou-Fasman method for protein secondary structure analysis.
Answer:
The Chou-Fasman method was among the first secondary structure prediction algorithms developed
and relies predominantly on probability parameters determined from relative frequencies of each amino
acid's appearance in each type of secondary structure (Chou and Fasman, 1974). The original Chou-
Fasman parameters, determined from the small sample of structures solved in the mid-1970s, produce
poor results compared to modern methods, though the parameterization has been updated since it was
first published. The Chou-Fasman method is roughly 56-60% accurate in predicting secondary
structures. The method is based on analyses of the relative frequencies of each amino acid in alpha
helices, beta sheets, and turns based on known protein structures solved with X-ray crystallography.
From these frequencies a set of probability parameters were derived for the appearance of each amino
acid in each secondary structure type, and these parameters are used to predict the probability that a
given sequence of amino acids would form a helix, a beta strand, or a turn in a protein. The method is
at most about 50–60% accurate in identifying correct secondary structures, which is significantly less
accurate than the modern machine learning–based techniques.
The Chou–Fasman method predicts helices and strands in a similar fashion, first searching linearly
through the sequence for a "nucleation" region of high helix or strand probability and then extending
the region until a subsequent four-residue window carries a probability of less than 1. As originally
described, four out of any six contiguous amino acids were sufficient to nucleate helix, and three out of
any contiguous five were sufficient for a sheet. The probability thresholds for helix and strand
nucleations are constant but not necessarily equal; originally 1.03 was set as the helix cutoff and 1.00
for the strand cutoff.
Turns are also evaluated in four-residue windows but are calculated using a multi-step procedure
because many turn regions contain amino acids that could also appear in helix or sheet regions. Four-
residue turns also have their own characteristic amino acids; proline and glycine are both common in
turns. A turn is predicted only if the turn probability is greater than the helix or sheet probabilities and a
probability value based on the positions of particular amino acids in the turn exceeds a predetermined
threshold. The turn probability p(t) is determined as:
p(t ) pt ( j ) pt ( j 1) pt ( j 2) pt ( j 3)
where j is the position of the amino acid in the four-residue window. If p(t) exceeds an arbitrary cutoff
value (originally 7.5e–3), the mean of the p(j)'s exceeds 1, and p(t) exceeds the alpha helix and beta
sheet probabilities for that window, then a turn is predicted. If the first two conditions are met but the
probability of a beta sheet p(b) exceeds p(t), then a sheet is predicted instead.
13. Describe GOR statistical method for secondary structure prediction.
Answer: The GOR method (short for Garnier–Osguthorpe–Robson) is an information theory-based
method for the prediction of secondary structures in proteins. It was developed in the late 1970s shortly
after the simpler Chou–Fasman method. Like Chou–Fasman, the GOR method is based
on probability parameters derived from empirical studies of known protein tertiary structures solved
by X-ray crystallography. However, unlike Chou–Fasman, the GOR method takes into account not only
the propensities of individual amino acids to form particular secondary structures, but also
the conditional probability of the amino acid to form a secondary structure given that its immediate
neighbors have already formed that structure. The method is therefore essentially Bayesian in its
analysis.
Method: The GOR method analyzes sequences to predict alpha helix, beta sheet, turn, or random
coil secondary structure at each position based on 17-amino-acid sequence windows. The original
description of the method included four scoring matrices of size 17×20, where the columns correspond
to the log-odds score, which reflects the probability of finding a given amino acid at each position in
the 17-residue sequence. The four matrices reflect the probabilities of the central, ninth amino acid
being in a helical, sheet, turn, or coil conformation. In subsequent revisions to the method, the turn
matrix was eliminated due to the high variability of sequences in turn regions (particularly over such a
large window). The method was considered as best requiring at least four contiguous residues to score
as alpha helices to classify the region as helical, and at least two contiguous residues for a beta sheet.
Algorithm: The mathematics and algorithm of the GOR method were based on an earlier series of
studies by Robson and colleagues reported mainly in the Journal of Molecular Biology and The
Biochemical Journal. The latter describes the information theoretic expansions in terms of conditional
information measures. The use of the word "simple" in the title of the GOR paper reflected the fact that
the above earlier methods provided proofs and techniques somewhat daunting by being rather
unfamiliar in protein science in the early 1970s; even Bayes methods were then unfamiliar and
controversial. An important feature of these early studies, which survived in the GOR method, was the
treatment of the sparse protein sequence data of the early 1970s by expected information measures.
That is, expectations on a Bayesian basis considering the distribution of plausible information measure
values given the actual frequencies (numbers of observations). The expectation measures resulting from
integration over this and similar distributions may now be seen as composed of "incomplete" or
extended zeta functions, e.g. z (s,observed frequency) – z (s, expected frequency) with incomplete zeta
One of the major thrusts of current bioinformatics approaches is the prediction and identification of
biologically active candidates, and mining and storage of related information. It also provides strategies
and algorithm to predict new drug targets and to store and manage available drug target information.
In molecular docking: Docking is an automated computer algorithm that attempts to find the best
matching between two molecules which is a computational determination of binding affinity between
molecules.
This includes determining the orientation of the compound, its conformational geometry, and the
scoring. The scoring may be a binding energy, free energy, or a qualitative numerical measure.
In some way, every docking algorithm automatically tries to put the compound in many different
orientations and conformations in the active site, and then computes a score for each.
Some bioinformatics programs store the data for all of the tested orientations, but most only keep a
number of those with the best scores.
Docking can be done using bioinformatics tools which are able to search a database containing
molecular structures and retrieve the molecules that can interact with the query structure.
It also aids in the building up chemical and biological information databases about ligands and
targets/proteins to identify and optimize novel drugs.
It is involved in devising in silico filters to calculate drug likeness or pharmacokinetic properties for the
chemical compounds prior to screening to enable early detection of the compounds which are more
likely to fail in clinical stages and further to enhance detection of promising entities.
Bioinformatics tools help in the identification of homologs of functional proteins such as motif, protein
families or domains.
It helps in the identification of targets by cross species examination by the use of pairwise or multiple
alignments.
It allows identifying drug candidates from a large collection of compound libraries by means of virtual
high-throughput screening (VHTS).
Homology modeling is extensively used for active site prediction of candidate drugs.
“In Silico” is an expression used to mean “performed on computer or via computer simulation.”
In Silico drug designing is thus the identification of the drug target molecule by employing
bioinformatics tools.
The inventive process of finding new medications based on the knowledge of a biological target is
called as drug designing. It can be accomplished in two ways:
Even though most of the processes depend on experimental tasks, in silico approaches are
playing important roles in every stage of this drug discovery pipeline which are described
below:
The program reads in molecular co-ordinate files and interactively displays the molecule on
the screen in a variety of representations and colour schemes. Supported input file formats
include Brookhaven Protein Databank (PDB), Tripos Associates' Alchemy and Sybyl Mol2
formats, Molecular Design Limited's (MDL) Mol file format, Minnesota Supercomputer
Centre's (MSC) XYZ (XMol) format and CHARMm format files. If connectivity information
is not contained in the file this is calculated automatically. The loaded molecule can be shown
as wireframe bonds, cylinder 'Dreiding' stick bonds, alpha-carbon trace, space-filling (CPK)
spheres, macromolecular ribbons (either smooth shaded solid ribbons or parallel strands),
hydrogen bonding and dot surface representations. Different parts of the molecule may be
represented and coloured independently of the rest of the molecule or displayed in several
representations simultaneously. The displayed molecule may be rotated, translated, zoomed
and z-clipped (slabbed) interactively using either the mouse, the scroll bars, the command line
or an attached dial box. RasMol can read a prepared list of commands from a 'script' file (or
via inter-process communication) to allow a given image or viewpoint to be restored quickly.
RasMol can also create a script file containing the commands required to regenerate the current
image. Finally, the rendered image may be written out in a variety of formats including either
raster or vector PostScript, GIF, PPM, BMP, PICT, Sun rasterfile or as a MolScript input script
or a Kinemage.
RasMol has been developed at the University of Edinburgh's Biocomputing Research Unit and
the BioMolecular Structure Department, Glaxo Research and Development, Greenford, U.K.
Substitution matrix
Substitution matrices such as BLOSUM are used for sequence alignment of proteins. A
Substitution matrix assigns a score for aligning any possible pair of residues. In general,
different substitution matrices are tailored to detecting similarities among sequences that are
diverged by differing degrees. A single matrix may be reasonably efficient over a relatively
broad range of evolutionary change. The BLOSUM-62 matrix is one of the best substitution
matrices for detecting weak protein similarities. BLOSUM matrices with high numbers are
designed for comparing closely related sequences, while those with low numbers are designed
for comparing distant related sequences. For example, BLOSUM-80 is used for alignments that
are more similar in sequence, and BLOSUM-45 is used for alignments that have diverged from
each other. For particularly long and weak alignments, the BLOSUM-45 matrix may provide
the best results. Short alignments are more easily detected using a matrix with a higher "relative
entropy" than that of BLOSUM-62. The BLOSUM series does not include any matrices with
relative entropies suitable for the shortest queries.