0% found this document useful (0 votes)
7 views

Bio Lec 4

This is related to bioinformatics

Uploaded by

Huma Tehreem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Bio Lec 4

This is related to bioinformatics

Uploaded by

Huma Tehreem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Introduction to

Bioinformatics
Profiles and Progressive AlignmentBy:Mirza A. Hammad
Profiles for families of sequences
can be built from MSAs
1 2 3
1 2 3
A 50% 75% 25%
C G — C 25% 0% 0%
A A T T 0% 0% 25%
A A A G 0% 25% 0%
— A — — 25% 0% 50%

Note: While profiles can be used for any kind of


sequence data, we’ll focus on protein sequences
Profiles

• Profile: A table that lists the frequencies of each amino acid in each
position of protein sequence

• Frequencies are calculated from a MSA containing a domain of interest

• Allows us to identify consensus sequence

• Derived scoring scheme allows us to align a new sequence to the


profile
• Profile can be used in database searches
• Find new sequences that match the profile

• Profiles also used to compute multiple alignments heuristically


• Progressive alignment
Profiles: Position-Specific Scoring
Matrix (PSSM)
• To compare a sequence to a profile, need to assign a
score for each amino acid

• The score the profile for amino acid a at position p is


M ( p , a )  f ( p , b )  s ( a , b )
20

where b 1

• f(p,b) = frequency of amino acid b in position p


• s(a,b) is the score of (a,b) (from, e.g., BLOSUM or PAM)
Profiles: PSSM

Insertion/deletion penalty
Gribskov et al. PNAS. 84 (13): 4355 (1987)
Profiles: Consensus Sequence

• A consensus residue C(p) is generated at each


position of the profile to aid the display of
alignments of target sequences with the profile

• The consensus residue c is the amino acid at p that


has the highest score M(p,c)

• c is the amino acid most mutationally similar to all the


aligned residues of the probe sequences at p, rather than
the most common one
Aligning a sequence to a profile
1 2 3 4 5
K L M – K
K .75 .25 .50
K L K L K
K M M L – L .75 .75
M L – L M M .25 .25 .50 .25
- .25 .25 .25
New sequence:
K K L L M
K K L - L M
K - L M – K
Align with profile: K - L K L K
K K L - L M K - M M L –
1 - 2 3 4 5 M - L – L M
Scoring a sequence-to-profile
alignment
• Score each column separately according to PSSM
• Each character contributes to score, weighed by its frequency

1 2 3 4 5 K K L - L M
K .75 .25 .50 1 - 2 3 4 5
L .75 .75
M .25 .25 .50 .25 Column 1 score:0.75 s(K,K) +
- .25 .25 .25
Profile-to-sequence alignments

• Optimum alignment can be found by dynamic programming


• Extension of Needleman-Wunsch

• Spaces are only added to msa – never removed


• Once a gap, always a gap

• Can align profiles to profiles


Evolutionary Profiles

• Profiles just seen are called average profiles

• Generally perform well, but disregard some of the


biology
• How did each position evolve?
• Amount of conservation varies from position to position
• Type of conservation varies from position to position

• Alternative: Evolutionary profiles


• Gribskov, M. and Veretnik, S., Methods in Enzymology
266, 198-212, 1996
Progressive multiple alignment
• Feng & Doolittle 1987, Higgins and Sharp 1988

• Idea: Sequences to be aligned are phylogenetically related


• these relationships are used to guide the alignment

• Popular implementations: CLUSTALW, PILEUP, T-Coffee


CLUSTALW

1. Perform pair-wise alignments between all pairs of sequences


(n x (n-1)/2 possibilities)

2. Generate distance matrix.


• Distance between a pair = number of mismatched positions in
alignment divided by total number of matched positions

3. Generate a Neighbor-Joining ‘guide tree’ from distance table

4. Use guide tree to progressively align sequences in pairs from


tips to root of tree.
• Actually, align profiles
• “Once a gap, always a gap”
CLUSTALW
CLUSTALW Tree

Tree calculated from an alignment of more than 1100 ring finger


domains, using ClustalW 1.83.
CLUSTALW heuristics
1. Individual weights are assigned to each sequence in a
partial alignment in order to downweight similar
sequences and up-weight highly divergent ones

2. Varying substitution matrices at different alignment stages


according to sequence divergence

3. Gaps
• Positions in early alignments where gaps have been opened
receive locally reduced gap penalties
• Residue-specific gap penalties and locally reduced gap penalties in
hydrophilic regions encourage new gaps in potential loop regions
rather than regular secondary structure.
Progressive Alignment:
Discussion
• Strengths:
• Speed
• Progression biologically sensible (aligns using a tree)

• Weaknesses:
• No objective function.
• No way of quantifying whether or not the alignment is good
Problems with CLUSTALW

• Local minimum problem:


• Alignment depends on sequence addition order

• With each alignment some proportion of residues are misaligned


• Worse for divergent sequences

• Errors get “locked in” and propagate as sequences are added

• Can result in arbitrary and incorrect alignments

• Clustal uses global alignment … may not be accurate for all parts of the
sequence
• T-Coffee considers local similarity as well as global
Iterative alignment

• To avoid local minima, realign subgroups of


sequences and then incorporate them into a
growing multiple sequence alignment

• Improves overall alignment score.


• May involve rebuilding the guide tree
• May be randomized

• Programs:
• MultAlin
• PRRP
• DIALIGN

You might also like