0% found this document useful (0 votes)
21 views11 pages

Unit 3 Bioinformatics

Bioinformatics

Uploaded by

Bhavana Manimala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views11 pages

Unit 3 Bioinformatics

Bioinformatics

Uploaded by

Bhavana Manimala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

MULTIPLE SEQUENCE ALIGNMENT

Muitiple sequence alignment is an extenslon of pairwise alignment to incorporate more


than twO sequences at a time. Multiple alignment methods try to align all of the sequences
in a given query set.

There are 6 methods of Multiple Sequence Alignments:


1. Sum of Pairs (SP) method
2. Progressive method
3. Iterative method

4. Motif finding method


5. DALI method

6. SSAP method

The common procedure for Multiple Sequence Alignment is as follows:


Apairwise alignment is done initially and then step-wise multiple alignment is done.

(A) Pairwise alignment


6 Pair wise Comparisons
then Cluster analysis

(B)Multiple alignment following the tree


from (A)
E Aign most similar pair
Öaps tooptinise alignment
Align next most similar pair

Align alignments -preserve gaps


New gap to optimiso aligament of (BD) with (AC)

Page 1
METHOD instead of
1. SUM OF PAIRS (SP) programming method. In this method aligning
The Sum-of-pairs method
is a dynamic sequencer
programming, we need to align3 or more
with dynamic
two sequences at atime
simultaneously. expensive
number of sequences because it is computationally
to any
Thistechnique is applicable
in both time and memory.
Process
pairs of query sequences
Standard dynamic programming is first used on all
possible matches or gaps at
then the "alignment space" is filled in by considering
intermediate positions
Finally construct an alignment between each two-sequence alignment.

2. PROGRESSIVE METHOD
This is the most commonly used approach of MSA. ClustalW, ClustalX, PileUP and PIMA are the

progressive alignment programs and are based on DP methods. Aslower but more accurate variant
of the progressive method is T-Coffee.
ClustalW produces the best match for the selected sequences and arranges them so that the
identities, similarities and differences can be seen.
ProcesS

"Perform pair-wise alignments for allsequences


"Usethe alignment scores that gives a phylogenetic tree using neighbour-joining methods.
"The sequences are aligned using the phylogenetic relationships indicated by the tree.

3. ITERATIVE METHÌD
Iterative methods begin by making an initial alignment of the sequences. These alignments are
then revised to give a more reasonable result. The objective of this approach is to improve the
overall alignment score.
Dialign, SAGA,BlockMaker are few programs that use iterative methods for MSAs.

4. MOTIF FINDING METHOD


Motif finding, also known as profile analysis,constructs global multiple sequence
alignments that
attempt to align short conserved sequence motifs among the sequences in the query set.
The

Page 2
prorie matrices are then used to search other
seguences for occurrences of the mot iey
characterize.
5. DALI METHOD

The DALI method, or distance matrix alignment, is afragment-based method for constructing
Structuralalignments based on contact similarity patterns between successive hexapeptides in
the query sequences.

6. SSAP METHOD

SSAP (sequentialstructure alignment program) is a dynamic programming-based method.


It is used in the construction of the CATH (Class, Architecture, Topology, Homology) hierarchical
database classification of protein fold.

APPLICATIONS OF MULTIPLE SEQUENCE ALIGNMENTS

1. MSAs can be used to find the regions of similar sequences in all of the sequences that define a
conserved consensus pattern or domain.

2. MSAs are powerful tools for identifying new members of the aligned group.
3. It is possible to query databases of MSAs with single sequences and to query
sequence
databases with multiple sequences.
4. Design of degenerate PCR primers is another major application for multiple
alignments.
5. MSA are used to find the evolution and the sequence
conservation in a group of genes. i.e. the
phylogenetic analysis.
PAIR-WISE ALIGNMENT
Pair-wise sequence alignment methods are used to find the best-matching piecewise (local) or
global alignments of two querysequences. Palr-wise alignments can only be used between two
Sequences at a time, but they are efficient to calculate and are often used for methods that do
not require extreme precision (such as searching a database for sequences with high similarity to

a query).
The three primarymethods of producing pair-wise alignments are
1. Dot-matrix methods,
2. Dynamic programming,and
3. Word methods.
both query sequence and
Allthree pair-wise methods use the longest subsequence that occurs in
sequences typically
the existing sequence i.e. 'Maximum Unique Match' (MUM). Longer MUM
reflect closer relatedness.

1, DOT-MATRIX METHODS

This was founded by Gibbs & Mc Intyre in 1970. The dot-matrix approach is qualitative and simple
but time-consuming method. In this method it is easy to visually identify mutations such as
insertions, deletions, repeats, or inverted repeats.

Process:
To construct adot-matrix plot, the two sequencesare written along Xand Yaxes of a
two-dimensional matrix.
Adot is placed at any point where the characters in the two columns match.
It is a typical recurrence plot.
For eg, if we consider 2 sequences as follows
GGCTT

GGATTGA

The following is the illustration of howa dot plot is constructed


G

The dot plots will appear as a single line along the matrix's main diagonal. The alignment with
the greatest number of identities will be the optimal alignment.

Uses of Dot Plot


Dot plots are used

1. To visually compare two sequences and detect the regions of close similarity between
them.

2. To assess repetitiveness in a single sequence.


3. Multiple similar structural domains can be visualized.

Problems with dot plots:


1. Lack of clarity,
2. Plots are semi-quantitative
3. Statistical analysis is not much involved in this.
4. Dot-plots are limited to two sequences.

2. DYNAMICPROGRAMMING METHOD
Dynamic programming is a method for breaking down the alignment of
sequences into small
parts.It is comparable to moving across a dot matrix and keeping
track of all the matching pairs.
It involves adding up those pairs that are along a
diagonal and substracting when insertions are
necessary to maintain an alignement.

Page ?
Dynamic programming can provide global (via the Needleman-Wunsch algorithm) or ioal
sequence alignments(via the Smith-Waterman algorithm).
ProcesS:
" Allpossible alignments of the 2 sequences are taken
" An alignment matrix is drawn using global or local alignment methods
" To calculate the score at any point in the alignment matrix,
The three adjacent positions in the alignment that represent the part of alignment are taken
and the following calculation is done
"A gap penalty was subtracted for an insertion( Eg: -2)
An identity or a mismatch isscored( Eg +1 or -1)
The final step is checking for the cellgiving the highest score and adding the scores along a
limited number of paths. The optimalalignment is the path which gives the highest total
SCore.

Uses of Dynamic Programming


Dot plots are used
1. Evaluate frameshifts i.e insertions or deletions

2. Useful for sequences containing large number of indels.


Problems with DynamicProgramming
1. This method requires large amounts of computing power or a system whose
architecture is specialized for dynamic programming.
2. It isslow for large numbers of or extremely long sequences.

3. WORD METHODS

Word methods, also known as k-tuple methods, are heuristic methods are more efficient than
dynamic programming.Word methods are best known for their implementation in the
database search tools FASTA and the BLAST family.

FASTA

In the FASTA method, the user defines a value k to use as


the word length with which to search
the database. The 4 steps involve
1) Find initial regions in search
sequence
2) Re-score to find top 10 initial regions
regions together
3) Attempt to join initial
region to find best fit
4) Optimize around initlal
sensitive.
This method is slower but more

BLAST
FASTA Without neglecting accuracy.
BLAST Was developed to provide afaster alternative to
Methodology of BLAST involves:
1) The process of finding initial words called seeding is done
of interest,
2) Words are aligned as set of 3 residues. After making words for the sequence
neighborhood words are also assembled.
3) Once both words and neighborhood words are assembled and compiled, they are
compared to the sequences in the database in order to find matches.
4) Scores of alignment are calculated in comparision to a pre-determined score T.

These methods are especially useful in large-scale database searches. Word methods identify a
series of short, nonoverlapping subsequences ("words") in the query sequence that are then
matched to candidate database sequences
2.
Geu Prcliction- Tmhortince shl Metlhgds.
one: ol the
G¢ne predictton.by 0Onpalational unelbods or (indóig tho.lncation of prolcin cnding rcgionsís
.cSsential Ussucs in bióinivmuics, refersto the-
Gcne predtctibn bnsicully mens locating gcnes along geione. Also oalled gend finticig, it
process of identityihg. he nyions of agitouie DNAAhut u1COlG gens
}s eudes protein cotio: gencs, hey ucioa clomcnts. such as, the reguatoty genes.
NA, tence antl.

Tn portance ofGene Prediction


contiguous aqüctces fnçtjonal gcns, lnOn,
AIS Inhe identiicatin of Hndamental.and t sentiaí elencnts of Lenome such as
. SiCang sites; iegulatdry sitcs, gng êncodiy: kin9wn protcins, otifs, EST, ACR,eto.
Distinguisih betveen coding. and noi-coding regiüns ofa gonome
Predict complcte cxon- inhow strüctures of prolein vödiig reçions
Desctibc individual gencs in terns ot heir function
it has vost applicatian în struclural genoniics functional genonnics,melabolonics, transcriptom ics,
proteonics, genone. studies aisd'other genetic relaled sludies includtng genctics disorders detection,
treatmeit and preventiot:

Bioinfor'natics anß the Prediction of Geres


With databasos of hyman and model. grganisn DNA sequences inctéasingqiicklý with ine, îthias become
aimost impossible io cary out thegohventiornal painstalking experimentation ou lhving celis and.organisms
(o predict gencs.
Porimerly, statlstical analysis ofthe rates of húmologuus reCOmbination òt scveLal ditferent gunes couid
detetrnine theie orderon a.certain chromosomc, und infoimation fron muuy such expcrimenis could be
.combind to create agenetic nap specifying the roogh-location ofknown geries relative to each other.
Howevet, today, the frontiers of bioinformaticstesearch are malking it increasingly.possible to predict the
function:of such a deluge of genes.básed on. its sequence alane.

Methods of Gene Prediction


Twoclâssës of meliodsarc generàlly adopted:

A. Similarity based searches:


I is a mthod based on sequencë-similarity' scarchës.
It is a conceptually simple åppraach that is based on finding.similarity iÎm gene scuences betweern ESTs
(èxpressed sequchee tag), proteins, of othér génomes to-lhe input genome.
This app1Dacli is baséd on:theassumption that fungtional regionsfoxans) arc mor conserved-cvolutiongrily
th¡n £onfunoional regions intergenic .or intronic. regions):
Onceihére:is similarity betwcein a certain genomic regiun and an EST,DNA, or protein, the sinilacity
information: canbÉ.Usd.to infer gene structirc.ortunctio of Uat règion,
FINDING METHODS
GENE PREDICTION OR GENE deantifylgg tle
rocoss o
prorliction or gonofinctlng refcrstotho
In computational biolgqy, gene olr
regions of genomic DNA.hal ancodc genes. niay also lr:hucdo prcdlcllon col
ns KNA.geS, bul
This inludes prolein-cading gons 08 well
functional elements suth as 1egulatoNTACIens. sprrclos
undarslmcíng tha ganoma ol a
Gene finding is one of the first and mosl porlant
stepsin
and
once it has been segUenced.
nalrnslaking
cxperinantalinn on llving; colls
In its earliest days, "genefnding" was based on
organisns.
cliforant gearies (;Old
analysis: of lhe rales of heoloyous recoinbinatia1 of sevetafmäiny suh xpefirT:Ms GONI
Slaistical certain chroniosome, and infotmalion.
from
Nci
deterihe their order ona ralaive t0
specvna. the roic1h locatiun of kriown gencs
be combinedto creale a genetic may
other.
and poweríui computatiornal resourcas al the
Today, witth comprehensive genome sequence liás b¡on rodafincd as a largely
computational
disposal of the researc1 cpn1nunity: gene flnding
probBerm.
distinguislied from deterrmining the funcion of
Detemining that a sequence is functional should be
the gene or its product.
gene predictlon is accurate stil demands in
Predicting the funclion of a gene and confirming that theassays, although frontiers
vivO expeimentationthrough gene knockóut and otherpossible tao predict the function of agene based
of bioinfámatics research aro making it inereasingly
on ts sequeice alane:
annotation, following sequence.assembly, the
Géne prediction is one of the key steps in genome
ilteringof non-coding regions and epeat masking.
Gene predictioD is closly related to the so-called 'target search problem' investigating how DNA
bindirg sites within the genome.
binding protins (transcription factors) locate'specific
are based on curent understanding of
Miany aspects of strúciural gene preictiai cell sucih ds gene tianscription, translation, prátein-grotein
underying biochemical processés in the are subject of active teseacch inthe
interactions and requlation.processes, which oteonics, metabolomies, and mÏre
various omics fields such as transcöptomicS,
Ôenomics.
generaly structural and functional
Epirical methods
gene finding systenis, the target genome is
In empieical (similarity,hamology or evidence-based)
extrinsic evidence in the form of the known expressed
searched for sequences that are simlar t Droteln prodcts, and homologouS of orthologous
Sequereetacs, messenger RNA(mRNA),
sequences. fron which it bad
mRNA sequence, it is trivial to derive a unlque genomis DNA sequenceDNA sequences
Given an possible coding
to have been transcribed. Given a protein sequence, a lanily of
of the genetic code.
can be derived by reverse translation a relalively straigl1tforward algorlthnic
have been determined, It is
Once candidateDNA sequences target
partial, and exact or inexact.
genome for natches, complete orand Spith-Waternan look for
probtem to efficiently search a
algorithms such as BI AS1, FA^TA
Given a squence, local allgnment sequence and possible candidale matches.
regions of similarity between the largeB
Malches çan be çomplele OF partlal, and cxactor inexact. The stucceSs of this ppraschis lertiled by
the conient_ and accuracy of lhe
sequeince dalabase.
Anign degres of similaniy lo a known mosse0er RNA Or protcin iroduct is strong ovlance o
regionofa targel genome ls a proloin-coding gene.
prolciH
Hovever, to apply this approach systenically requircs extensíve:sejuencing of mRNA and
in Tie
proouç$. Nol only is this expensive. but In cmplex oroanlsos, onw asubssl of ynes many
meaning hal exlrinslc ovidonce lo
g a s s genome are expressed at ahy given time.
genes is nof readily accessible in any slngle çell cullure.
denes in a complex organism requrese
Thus, tO collecl exrinsic evidence:.for imost or ail of the presenls.further littiLlies.
study of many hundreds or thousands of cll types, which
cdevelopment as an ambr/o of
exaniple, some human gencs may be cxrcssed ony during
ror
fetus, which might be difficult o study for elhlcal teasons. databases have been
gensraleu
Sequence
Despile tlhese difficulties, exBensive transcripl aid prolein and yesl.
otier important modcl organisns in bidoy, such as miçe
Tor uman as wel as diffsrÇntolner
írom manyseveral
database contalns trariscriot and protein sequence h°man and
Por exarmple, (he RefSeq system comprehensively maps this evidence
lo
species;, and the Ensembl
genomies.
both incomplel and-contain small'but sigmficant.
that thesé databases are
It is, however, likely data:
amounits of eroneous andChP:
íechiologlessuch as RNA-Seq prdiction
high-thrpughput franscriotome seguencing
additional extrinsi:evidence inlo genemehgds of
New opportunities for incorporating, alternalive. to-pYevioUs
seguenCng open allowstructurailyrich and'aore accurale
and validation, ånd expressed sequence teq'or DNA miccoarfay
measuring gene expressian
such as
dealing. with Sequencing errorsin faw DNA
invölve framestuf
ehallenges inyoved in genepredictioD
seauenice assembly, haridling short reads,
Major
dat, depéndeniçe gn the guality of the
aind incornplele genes. searching forgene sequence
mitatiors, gverlapping genes tr-nsfer when cxislene of
essential ioçonsider hortzontal.qeDe in.current gene detection taolsis
genès under the
it's
Inprokayotes.addiilonal important actor underused containinga cluster of detectors treat
hömolbgy. An fünçtioning únits of DNA gene
(wriich areprokäryotes eukarybtès. Mlost'popilar
geneclusfersöperons and
promoter)in both olhiers, which i_norbiologically scGurate.
contról of a single independentof
eachigene in isolation,
and sipnal dotectiaá, Bècause
Aiinitio metbods básed an gene ç¡nlènt
Inethod genes, it is also
gene prediction is an intrinsic obtaining extrinsic evideRce for malný aloneis
A5 Initio
inherént expénse,anddifficulty in which the genomiç DNA seguence
of the' iitiogene inding, in ofprotein-coding genes.
necessary to resort to sb certain tell-tale signs
systeinatically-searchedfor specllic-seuences thal indicaté the
as either signals,
categorizedstalisticalpropertiesof she proteir.-Goding seguence itsel!.
broadly
hese signs can be néarby,or conienl, extrinsic
preserice of a' gene charácterlzed as gene precketion,since
functional.
accurately pulativegenc is
Abinitio:geñe. fîn£ing mighl be more estabish thate
required to-conchusively
evidence is generally.
Ab They
herandom against.
trained The
Anot achleved
learning organisms, CompleX
Advanced
Thevariety
GeneMark
few
sensViy Initio build recent SNAP GLIMMER
of
techniques
methods fields only prÙbabilistic
different gene
a addressinggenä is
discriminative to.
approaches anoher
limited system finders
ncreases, have leam finder signal
like sucçess;
an problems popular lsmodels, for
is
been
accUracy accurate support likeHMM-based widely aand both
model approach. content such
benchmarked, mSplicer, ngtable prokaryotic
vector related ysed as
suffers geneusing measurements. hidden
like
examples and
prediction machines
hidden. to
Eukaryotlo
äs CONTRAST,Or Genscan, using highly Markov and
awith eukaryotic
resultsome Markov. for a are
scoring:function. successful
gene and accurate
approaching of theInitioab models
increased support mGenealso attempts finderGENSCAN genomes
gene
on
gene (HMMs)
vector gene
tofinders,finder
false100% use.miachine genome a be and typicaly
predigtion. to
machines moreadaptablegeneid for comtine
positivs. by
sensitivity, prokary
seqUencecOInparison, ise
or þrograms. ir
conditiona! ites.formation
however that to
different
it have
was trom
as
a

You might also like