0% found this document useful (0 votes)

87 views

Alignment of Sequences

This document discusses pairwise sequence alignment and scoring matrices. It begins by introducing pairwise alignment and its importance for revealing functional, structural, and evolutionary information. It then covers local versus global alignment and how dynamic programming is used for pairwise sequence alignment. The document also discusses how scoring matrices are constructed to generalize sequence alignment scoring based on biological evidence of amino acid substitutions. Common scoring matrices like PAM and BLOSUM are introduced. Finally, it discusses the differences between local and global alignment and why local alignment is often preferred.

Uploaded by

Raj Kumar Soni

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views

Alignment of Sequences

Uploaded by

Raj Kumar Soni

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 33

I519 Introduction to Bioinformatics, Fall 2012

Pairwise alignment of DNA/protein

sequences
We compare biological molecules,
not any two strings!
 Sequence alignment reveals function, structure,
and evolutionary information!
 From edit distance to distance between two
biological molecules with biological meaning—
the scoring matrix
 Local alignment versus global alignment
 But still the dynamic programming algorithm is
the algorithm behind pairwise alignment of
biological sequences
Aligning DNA Sequences: scoring
matrix

V = ATCTGATG n=8 4 matches

m=7 1 mismatches
W = TGCATAC
match
2 insertions
mismatch 2 deletions

V A T C T G A T G
W T G C A T A C
indels deletion
insertion
Simple scoring
 When mismatches are penalized by –μ, indels
are penalized by –σ,
and matches are rewarded with +1,
the resulting score is:

#matches – μ(#mismatches) – σ (#indels)

Scoring matrices
To generalize scoring, consider a (4+1) x(4+1)
scoring matrix δ.
In the case of an amino acid sequence alignment,
the scoring matrix would be a (20+1)x(20+1)
size. The addition of 1 is to include the score
for comparison of a gap character “-”.
This will simplify the algorithm as follows:
si-1,j-1 + δ (vi, wj)
si,j = max s i-1,j + δ (vi, -)
s i,j-1 + δ (-, wj)
Scoring matrices for DNA sequence
alignment
 A simple positive score for matches and a
negative for mismatches and gaps are most
often used.
 Transversions penalized more than transitions
– transitions: replacement of a purine base with
another purine or replacement of a pyrimidine
with another pyrimidine (A <-> G, C <-> T)
– transversions: replacement of a purine with a
pyrimidine or vice versa.
– Transition mutations are more common than
transversions
Making a scoring matrix for
protein sequence alignment
 Scoring matrices are created based on biological
evidence.
 Alignments can be thought of as two sequences
that differ due to mutations.
 Some of these mutations have little effect on the
protein’s function, therefore some penalties,
δ(vi , wj), will be less harsh than others.
 We need to know how often one amino acid is
substituted for another in related proteins
Scoring matrix: example
A R N K
• Notice that although
A 5 -2 -1 -1
R and K are different
R - 7 -1 3
amino acids, they
N - - 7 0 have a positive score.
K - - - 6
• Why? They are both
positively charged
amino acids will not
greatly change
function of protein.
Conservation
 Amino acid changes that tend to preserve the
physico-chemical properties of the original
residue
– Polar to polar
• aspartate  glutamate
– Nonpolar to nonpolar
• alanine  valine
– Similarly behaving residues
• leucine to isoleucine
Common scoring matrices for
protein sequence alignment
 Amino acid substitution matrices
– PAM
– BLOSUM

 Try to compare protein coding regions at amino

acid level
– DNA is less conserved than protein sequences
(codon degeneracy; synonymous mutations)
– Less effective to compare coding regions at
nucleotide level
Reading: Chapter 4, 4.3
PAM
 Point Accepted Mutation (Dayhoff et al.)
 1 PAM = PAM1 = 1% average change of all
amino acid positions
– After 100 PAMs of evolution, not every residue
will have changed
• some residues may have mutated several times
• some residues may have returned to their original state
• some residues may not changed at all
PAM1 & PAM250
Substitution PAM1 PAM250
Phe to Ala 0.0002 0.04
Phe to Arg 0.0001 0.01
Phe to Asp
..
Phe to Phe 0.9946 0.32
…

Sum 1 1

Normalized probability scores for changing Phe to other amino

acids at PAM1 and PAM250 evolutionary distances

Chapter 3, table 3.2

Log-odds substitution matrices
 Using amino acid changes that were observed in
closely related proteins; they represented amino
acid substitutions that don’t significantly change
the structure and function of the protein.
– “accepted mutations” or “accepted” by natural
selection.
 Log-odds of the probability of matching a pair of
amino acids in this database relative to a
random one
– Ref: Amino acid substitution matrices from protein
blocks (PNAS. 1992, 89(22): 10915–10919)
Log odd scores
Define fij as the frequency of observing amino acid pair i, j. Then the
observed probability of occurrence for each i, j pair is
The amino acid substitution is
considered as a Markov model
 A Markov model is characterized by a series of
changes of state in a system such that a change
from one state to another does not depend on
the previous history of the state
 Use of the Markov model makes it possible to
extrapolate amino acid substitutions observed
over a relatively short period of evolutionary time
to longer periods of evolutionary time
– PAMx = PAM1x
– The multiplication of two PAM1 matrices -> PAM2
PAM250
 PAM250 is a widely used scoring matrix:
 PAM250 = PAM1250

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ...
A R N D C Q E G H I L K ...
Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ...
Arg R 3 17 4 3 2 5 3 2 6 3 2 9
Asn N 4 4 6 7 2 5 6 4 6 3 2 5
Asp D 5 4 8 11 1 7 10 5 6 3 2 5
Cys C 2 1 1 1 52 1 1 2 2 2 1 1
Gln Q 3 5 5 6 1 10 7 3 7 2 3 5
...
Trp W 0 2 0 0 0 0 0 0 1 0 1 0
Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1
Val V 7 4 4 4 4 4 4 4 5 4 15 10
BLOSUM
 Blocks Substitution Matrix
 Scores derived from observations of the
frequencies of substitutions in blocks of local
alignments in related proteins
 Matrix name indicates evolutionary distance
– BLOSUM62 was created using sequences
sharing no more than 62% identity
BLOSUM versus PAM
 The PAM family
– PAM matrices are based on global alignments of closely related proteins.
– The PAM1 is the matrix calculated from comparisons of sequences with no
more than 1% divergence; Other PAM matrices are extrapolated from
PAM1.
 The BLOSUM family
– BLOSUM matrices are based on local alignments (blocks)
– All BLOSUM matrices are based on observed alignments (BLOSUM 62 is a
matrix calculated from comparisons of sequences with no less than 62%
divergence)
 Higher numbers in the PAM matrix naming scheme
denote larger evolutionary distance; BLOSUM is the
opposite.
– For alignment of distant proteins, you use PAM150 instead of PAM100, or
BLOSUM50 instead of BLOSUM62.
Local vs. global alignment

• Global Alignment
--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC
| || | || | | | ||| || | | | | |||| |
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C

• Local Alignment—better alignment to find

conserved segment
tccCAGTTATGTCAGgggacacgagcatgcagagac
||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
Local vs. global alignment
 The Global Alignment Problem tries to find the
longest path between vertices (0,0) and (n,m) in
the edit/alignment graph.
– The Needleman–Wunsch algorithm

 The Local Alignment Problem tries to find the

longest path among paths between arbitrary
vertices (i,j) and (i’, j’) in the edit graph.
– The Smith–Waterman algorithm
Local alignments: why?
 Two genes in different species may be similar
over short conserved regions and dissimilar over
remaining regions.
 Example:
– Homeobox genes have a short region called the
homeodomain that is highly conserved between
species.
– A global alignment would not find the
homeodomain because it would try to align the
ENTIRE sequence
The local alignment problem
 Goal: Find the best local alignment between two
strings
 Input : Strings v, w and scoring matrix δ
 Output : Alignment of substrings of v and w
whose alignment score is maximum among all
possible alignment of all possible substrings
Local alignment: example

Compute a “mini”
Global Alignment to
Local alignment
get Local

Global alignment
Local alignment: running time
• Long run time O(n6):

- In the grid of size n x n

there are ~n2 vertices
(i,j) that may serve as a
source and ~n2 vertices
(i’,j’) that may serve as a
sink.
- For each such vertices
computing alignments
from (i,j) to (i’,j’) takes
O(n2) time.
We do NOT go with this algorithm!
The local alignment recurrence

• The largest value of si,j over the whole edit

graph is the score of the best local
alignment.

• The recurrence:
Notice there is only
0
this change from the
si,j = max si-1,j-1 + δ (vi, wj) original recurrence of
s i-1,j + δ (vi, -) a Global Alignment
s i,j-1 + δ (-, wj)
The local alignment recurrence
• The largest value of si,j over the whole edit graph
is the score of the best local alignment.
• The recurrence:
Power of ZERO: there is
0 only this change from the
si,j = max si-1,j-1 + δ (vi, wj) original recurrence of a
s i-1,j + δ (vi, -) Global Alignment - since
there is only one “free ride”
s i,j-1 + δ (-, wj) edge entering into every
vertex
• Complexity: O(N2), or O(MN)
• Initialization will be different
Scoring indels: naive approach
 A fixed penalty σ is given to every indel:
– -σ for 1 indel,
– -2σ for 2 consecutive indels
– -3σ for 3 consecutive indels, etc.

 Can be too severe penalty for a series of 100

consecutive indels
Arbitrary gap penalty?

There are many such edges!

Adding them to the graph

increases the running time
of the alignment algorithm
by a factor of n (where n
is the number of vertices)

So the complexity increases

from O(n2) to O(n3)
Affine gap penalties
 In nature, a series of k indels often come as a
single event rather than a series of k single
nucleotide events:

Normal scoring would

This is more give the same score This is less
likely. for both alignments likely.
Affine gap penalties
 Score for a gap of length x is:
-(ρ + σx)
where ρ >0 is the penalty for introducing a gap:
gap opening penalty
ρ will be large relative to σ:
gap extension penalty
because you do not want to add too much of a
penalty for extending the gap.

 Reduced penalties (as compared to naïve

scoring) are given to runs of horizontal and
vertical edges
Affine gap penalty recurrences

Continue gap in y (deletion)

Start gap in y (deletion)

Continue gap in x (insertion)

Start gap in x (insertion)

Match or mismatch
End with deletion
End with insertion
Readings
 Chapter 4, 4.1
– “Alignment can reveal homology between
sequences” (similarity vs homology)
– “It is easier to detect homology when comparing
protein sequences than when comparing nucleic
acid sequences”
 Primer article: What is dynamic programming by
Eddy
We will continue on
 Significance of an alignment (score)
– Homologous or not?

 Faster alignment tools for database search

Google UX Design Certificate - Competitor Audit - Food Trucks (Example)
No ratings yet
Google UX Design Certificate - Competitor Audit - Food Trucks (Example)
7 pages
Costing On Itinerary
No ratings yet
Costing On Itinerary
6 pages
Spa To Sell Dionisio
100% (1)
Spa To Sell Dionisio
2 pages
CHCADV001 - Roleplay V1.1 - Done
50% (2)
CHCADV001 - Roleplay V1.1 - Done
6 pages
Sequence Comparison
No ratings yet
Sequence Comparison
39 pages
lec-02
No ratings yet
lec-02
103 pages
Sequence Analysis - Pairwise Alignment
No ratings yet
Sequence Analysis - Pairwise Alignment
26 pages
Sequence Alignment: Scoring Matrices
No ratings yet
Sequence Alignment: Scoring Matrices
30 pages
4. Sequence Alignment
No ratings yet
4. Sequence Alignment
24 pages
Introduction To Bioinformatics: Sequence Alignment
No ratings yet
Introduction To Bioinformatics: Sequence Alignment
29 pages
W03_Pairwise
No ratings yet
W03_Pairwise
55 pages
Sequence Alignment Presentation
No ratings yet
Sequence Alignment Presentation
27 pages
Leklj
No ratings yet
Leklj
24 pages
2. Sequence Alignment
No ratings yet
2. Sequence Alignment
25 pages
Chap 03 BioInfo
No ratings yet
Chap 03 BioInfo
15 pages
2-Substitution Matrices and Python - 2017
No ratings yet
2-Substitution Matrices and Python - 2017
65 pages
Pairwise Sequence Alignment: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu January 2001
No ratings yet
Pairwise Sequence Alignment: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu January 2001
18 pages
Unit Ii
No ratings yet
Unit Ii
14 pages
Sequence Alignment
No ratings yet
Sequence Alignment
36 pages
BIOINFORMATICS
No ratings yet
BIOINFORMATICS
21 pages
Sequence Comparison Homology and Similarity
No ratings yet
Sequence Comparison Homology and Similarity
12 pages
BLAST Lecture Notes
No ratings yet
BLAST Lecture Notes
16 pages
Sequence Alignment Methods and Algorithms
75% (4)
Sequence Alignment Methods and Algorithms
37 pages
CL662 PW 02 Gene Finding
No ratings yet
CL662 PW 02 Gene Finding
39 pages
Evolutionary Change of Amino Acid Sequences
No ratings yet
Evolutionary Change of Amino Acid Sequences
16 pages
Frid Seminar
No ratings yet
Frid Seminar
30 pages
Md Trajectories Hp Mdc 15
No ratings yet
Md Trajectories Hp Mdc 15
24 pages
L17 Genome Rearrangement
No ratings yet
L17 Genome Rearrangement
45 pages
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
No ratings yet
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
51 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
7 pages
M16 Tier1
No ratings yet
M16 Tier1
184 pages
PDC Minor 2 - Slides
No ratings yet
PDC Minor 2 - Slides
190 pages
Factor Analysis - Spss
No ratings yet
Factor Analysis - Spss
15 pages
RNA Structure
No ratings yet
RNA Structure
52 pages
Unit - Ii Sequence Analysis: Pair-Wise Sequence Comparison
No ratings yet
Unit - Ii Sequence Analysis: Pair-Wise Sequence Comparison
17 pages
Bioinformatics Seminar3rdOct18
No ratings yet
Bioinformatics Seminar3rdOct18
25 pages
BLAST (Basic Local Alignment Search Tool)
100% (1)
BLAST (Basic Local Alignment Search Tool)
23 pages
lecture2_sequence_alignment
No ratings yet
lecture2_sequence_alignment
26 pages
MIT7 91JS14 Pset1 Ans
No ratings yet
MIT7 91JS14 Pset1 Ans
11 pages
Tutorial Note 10 Midterm Solutions & Phylogenetic Tree
No ratings yet
Tutorial Note 10 Midterm Solutions & Phylogenetic Tree
35 pages
Saul's Protein Synthesis Rolling Dice activity
No ratings yet
Saul's Protein Synthesis Rolling Dice activity
8 pages
Analysis of MD Trajectories
No ratings yet
Analysis of MD Trajectories
17 pages
Response To Reviewers' Comments For Manuscript IEEE L-CSS 20-0350, Version 1
No ratings yet
Response To Reviewers' Comments For Manuscript IEEE L-CSS 20-0350, Version 1
13 pages
Lecture 6 Evolutionary Sequence Alignment Algorithms
No ratings yet
Lecture 6 Evolutionary Sequence Alignment Algorithms
26 pages
FASTA-05042018
No ratings yet
FASTA-05042018
25 pages
Btad157 Supplementary Data
No ratings yet
Btad157 Supplementary Data
12 pages
Bio Lec 4
No ratings yet
Bio Lec 4
18 pages
Exploring The Potential Energy Surface
No ratings yet
Exploring The Potential Energy Surface
16 pages
Chapter 3 Alignment Bioinformatics DR - Tuan LMS 2022
No ratings yet
Chapter 3 Alignment Bioinformatics DR - Tuan LMS 2022
26 pages
JCH Phys
No ratings yet
JCH Phys
26 pages
3 DNASequencingSequencer PWRPT File
No ratings yet
3 DNASequencingSequencer PWRPT File
17 pages
Sequence Alignment: Sequence Alignment Is The Most Important Task in Bioinformatics!
No ratings yet
Sequence Alignment: Sequence Alignment Is The Most Important Task in Bioinformatics!
13 pages
Green Propylene and Polypropylene Production From Glycerol Process Simulation and Economic Evaluation
No ratings yet
Green Propylene and Polypropylene Production From Glycerol Process Simulation and Economic Evaluation
34 pages
Stat520ch6slides PACF of MA
No ratings yet
Stat520ch6slides PACF of MA
21 pages
Chemistry 27: Practice Exam 3-A
No ratings yet
Chemistry 27: Practice Exam 3-A
10 pages
L6-Pairwise Seq Alignment
No ratings yet
L6-Pairwise Seq Alignment
70 pages
Phase Balancing Using Genetic Algorithm: Example
No ratings yet
Phase Balancing Using Genetic Algorithm: Example
3 pages
Multiple Seq Alignment
No ratings yet
Multiple Seq Alignment
36 pages
MD Analysis EDS
No ratings yet
MD Analysis EDS
22 pages
PhD-Viva-SoumiDas-Final (1)
No ratings yet
PhD-Viva-SoumiDas-Final (1)
22 pages
Chapter 3 Biochemistry Lehninger Slides
No ratings yet
Chapter 3 Biochemistry Lehninger Slides
55 pages
Fa SPSS
No ratings yet
Fa SPSS
40 pages
PDC Lecture Notes 7 - Stability of Closed-Loop Systems 2018
No ratings yet
PDC Lecture Notes 7 - Stability of Closed-Loop Systems 2018
56 pages
Classical Approach to Constrained and Unconstrained Molecular Dynamics
From Everand
Classical Approach to Constrained and Unconstrained Molecular Dynamics
Ajith Gunaratne
No ratings yet
Immunology Topic
No ratings yet
Immunology Topic
10 pages
P S L Gy: Posology and Dosage Regimen
No ratings yet
P S L Gy: Posology and Dosage Regimen
7 pages
Immunology Topic
No ratings yet
Immunology Topic
22 pages
Chapter 6 XA9846739 Immunoassays in Clinical Chemistry (Principles of Immunoradiometric Assays) R.S. Chapman
No ratings yet
Chapter 6 XA9846739 Immunoassays in Clinical Chemistry (Principles of Immunoradiometric Assays) R.S. Chapman
14 pages
Immunofluorescence: Tapeshwar Yadav (Lecturer)
No ratings yet
Immunofluorescence: Tapeshwar Yadav (Lecturer)
23 pages
Biosensor: R. Parthasarathy
No ratings yet
Biosensor: R. Parthasarathy
13 pages
Dr. Naitik D Trivedi & Dr. Upama N. Trivedi: Multiple Choice Questions (Pharmacognosy)
100% (1)
Dr. Naitik D Trivedi & Dr. Upama N. Trivedi: Multiple Choice Questions (Pharmacognosy)
24 pages
Customer Grievance Redressal Policy - 2018
No ratings yet
Customer Grievance Redressal Policy - 2018
14 pages
Surface Plasmon Resonance (SPR)
No ratings yet
Surface Plasmon Resonance (SPR)
11 pages
General Instructions Guidelines CHO NHM MP
No ratings yet
General Instructions Guidelines CHO NHM MP
3 pages
PCR, Types and Application
100% (1)
PCR, Types and Application
17 pages
September 18, 2020 October 6, 2020: Credit Card Statement
No ratings yet
September 18, 2020 October 6, 2020: Credit Card Statement
3 pages
RT PCR in Covid Detection
No ratings yet
RT PCR in Covid Detection
9 pages
Ecogenetics, Evolutionary Biology, Genomics, and Medicine
No ratings yet
Ecogenetics, Evolutionary Biology, Genomics, and Medicine
59 pages
Higher Maths: Revision Notes
No ratings yet
Higher Maths: Revision Notes
26 pages
Note That There Are Several Different "Basic Blast" Programs Available at Ncbi (Including Nucleotide Blast, Protein Blast, and Blastx)
No ratings yet
Note That There Are Several Different "Basic Blast" Programs Available at Ncbi (Including Nucleotide Blast, Protein Blast, and Blastx)
10 pages
Upper GI Bleed - Symposium
No ratings yet
Upper GI Bleed - Symposium
38 pages
Ethics Group Assignment Final Presentation
No ratings yet
Ethics Group Assignment Final Presentation
14 pages
Updated Draft Agenda With Key Discussion Points - 10th MiNE INDIA, 13th Dec 2019, Mumbai, Bangalore
No ratings yet
Updated Draft Agenda With Key Discussion Points - 10th MiNE INDIA, 13th Dec 2019, Mumbai, Bangalore
2 pages
Lipids Classification and Types
No ratings yet
Lipids Classification and Types
4 pages
2010 Escape Hybrid Mariner Hybrid: Emergency Response Guide
No ratings yet
2010 Escape Hybrid Mariner Hybrid: Emergency Response Guide
12 pages
Đề gì đó ở Vĩnh Phúc
No ratings yet
Đề gì đó ở Vĩnh Phúc
6 pages
Love and Relationships: For Use With Text, Human Sexuality Today, 5 Edition. Bruce M. King
No ratings yet
Love and Relationships: For Use With Text, Human Sexuality Today, 5 Edition. Bruce M. King
39 pages
Lesson Plan
100% (1)
Lesson Plan
3 pages
CK45
No ratings yet
CK45
5 pages
Chemistry Pollution Project
No ratings yet
Chemistry Pollution Project
2 pages
PIP STS02380, Application of ACI336.1-01 Specification For The Construction of Drilled Piers
No ratings yet
PIP STS02380, Application of ACI336.1-01 Specification For The Construction of Drilled Piers
9 pages
Application Checklist
No ratings yet
Application Checklist
1 page
AI in Radiology
No ratings yet
AI in Radiology
2 pages
Auxin Application and Cutting Length Affect Rooting in Cuphea Hyssopifolia Stem Cuttings
No ratings yet
Auxin Application and Cutting Length Affect Rooting in Cuphea Hyssopifolia Stem Cuttings
4 pages
Concrete Construction Guide
No ratings yet
Concrete Construction Guide
14 pages
CATALOGUE OF SPARE PARTS Vespa-S-125-4T - Euro-3-UK-en
No ratings yet
CATALOGUE OF SPARE PARTS Vespa-S-125-4T - Euro-3-UK-en
86 pages
Heredity Worksheet
No ratings yet
Heredity Worksheet
3 pages
Assignment 1
No ratings yet
Assignment 1
1 page
Entrepreneurship Project Mcdonalds Slideshare
No ratings yet
Entrepreneurship Project Mcdonalds Slideshare
50 pages
Re-Refining of Waste Lubricating Oil by PDF
No ratings yet
Re-Refining of Waste Lubricating Oil by PDF
10 pages
2016 06 WP Accelerated-Shelf-Life-Testing EN
No ratings yet
2016 06 WP Accelerated-Shelf-Life-Testing EN
10 pages
Pre Suffixes
No ratings yet
Pre Suffixes
1 page
Room List
No ratings yet
Room List
24 pages
Chapter 4 Cell Structure and Function-converted
No ratings yet
Chapter 4 Cell Structure and Function-converted
33 pages
Hvac Specifications
100% (1)
Hvac Specifications
46 pages
IAHSS Crime Survey
No ratings yet
IAHSS Crime Survey
21 pages