Sequence Analysis
1 DNA SEQUENCING & SEQUENCE ALIGNMENT
Sequence Alignment
Study Goal
3
What is a sequence alignment? The difference between a global and local alignment and what the
uses of each are. How to use the dot matrix methods to analyze genes and chromosomes The steps performed by the Needleman-Wunsh and SmithWaterman Algorithms to produce a sequence alignment How to use scoring matrix values and gap penalties to produce a sequence alignment
Sequence Alignment
4
Definition: the comparison of two or more sequences by
searching for a series of individual characters or character patterns that are in the same order in the sequences
Sequence alignment score:
Sum of the individual log odds scores for each pair of aligned sequence characters in an alignment less a penalty for each gap of one more position
Pair-wise Sequence Alignment
6
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition Given two strings
x = x1x2...xM,
y = y1y2yN,
an alignment is an assignment of gaps to positions 0,, N in x, and 0,, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence
What is a good alignment?
7
AGGCTAGTT, AGCGAAGTTT
AGGCTAGTTAGCGAAGTTT AGGCTA-GTTAG-CGAAGTTT AGGC-TA-GTTAG-CG-AAGTTT 6 matches, 3 mismatches, 1 gap
7 matches, 1 mismatch, 3 gaps
7 matches, 0 mismatches, 5 gaps
Evolution at the DNA level
8
Deletion Mutation ACGGTGCAGTTACCA AC----CAGTCCACCA SEQUENCE EDITS
REARRANGEMENTS Inversion Translocation Duplication
Evolutionary
9
next generation
OK OK OK X X
Still OK?
Scoring Function
10
Sequence edits: Mutations Insertions Deletions Scoring Function:
AGGCCTC AGGACTC AGGGCCTC AGG . CTC
Alternative definition: minimal edit distance Given two strings x, y, find minimum # of edits (insertions, deletions, mutations) to transform one string to the other
Match: Mismatch: Gap:
+m -s -d
Score F = (# matches) m - (# mismatches) s (#gaps) d
Pair-wise Alignment
11
The alignment of two sequences (DNA or protein) is
a relatively straightforward computational problem.
There are lots of possible alignments.
Two sequences can always be aligned. Sequence alignments have to be scored. Often there is more than one solution with the
same score.
How do we compute the best alignment?
12
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Too many possible alignments:
>> 2N
Local alignment vs. Global Alignment
13
Use to align the entire sequence Best for same length sequence
Local alignment vs. Global Alignment
14
Use to align the similar sequence along certain length Best for sequence sharing a conserved region or domain
What do we want alignment for?
15
Orthologous
Xenologous transferred genes
Matches to similar sequences
16
Sequence conservation Structure conservation function conservation
What is conserved in a gene [protein] family is functionally important
Due to purifying selection driven by functional constraints observable in a bckground described by the theory of neutral evolution Fast enough that pseudogenes rapidly deteriorate over evolutionary timescale In any prokaryotic genome, homologs from more than one distantly related species are detectable for 70 80 % of proteins
Application: Comparison of sequence/structures can identify
homologous relationships, allowing inference of function based on that relationship
Methods of Alignment
17
By hand - slide sequences on two lines of a word
processor Dot plot
with windows
Rigorous mathematical approach Dynamic programming (slow, optimal) Heuristic methods (fast, approximate) BLAST and FASTA
Word matching and hash tables0
Align by Hand
18
GATCGCCTA_TTACGTCCTGGAC <---> AGGCATACGTA_GCCCTTTCGC You still need some kind of scoring system to find the best alignment
Dotplot
19
A dotplot gives an overview of all possible alignments
A T T C A C A T A T A C A T T A C G T A C
Sequence 2
Sequence 1
Dotplot
20
In a dotplot each diagonal corresponds to a possible (ungapped) alignment
A T T C A C A T A T A C A T T A C G T A C
Sequence 2
Sequence 1
One possible alignment:
T A C A T T A C G T A C A T A C A C T T A
Insertions / Deletions in a Dotplot
21
Sequence 2
T A C T G T C A T T A C T G T T C A T
Sequence 1
T A C T G - T C A T | | | | | | | | | T A C T G T T C A T
Dotplot
(Window = 13022Stringency = 9) /
Hemoglobin -chain
Hemoglobin -chain
Word Size Algorithm
23
T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C
Word Size = 3
C T A T G A C A
T A C G G T A T G
Window / Stringency
24
Score = 11 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 11 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 7 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Typical window size for DNA sequences is 15 bases, and stringency or match requirement in this window is 10, meaning if there are 10-15 matches within the window, and a dot is printed at the first base in the windows
Scoring Matrix Filtering Matrix: PAM250 Window = 12 Stringency = 9
Dotplot
(Window = 18 /25 Stringency = 10) Hemoglobin -chain
Hemoglobin -chain
Considerations
26
The window/stringency method is more sensitive than the
wordsize method (ambiguities are permitted).
The smaller the window, the larger the weight of statistical
(unspecific) matches.
With large windows the sensitivity for short sequences is
reduced.
Insertions/deletions are not treated explicitly.
Dot Matrix
27
Unless two sequences are known to be very much alike, the
dot matrix method should be considered as a first choice for pair-wise sequence alignment Readily reveal the presence of insertions/deletions and direct and inverted repeats DNA sequence dot matrix comparison: long windows and high stringencies (7/11, 11/15) Protein sequences: short windows, stringencies (1/1) except for short domain of partial similaritity in not similar sequences (15/5)
28
Programs:
DOTTER http://www.cgb.ki.se/cgb/groups/sonnhammer/Dotter.h tml DOTPLOT - http://helix.nih.gov/docs/gcg/dotplot.html PLALIGN in FASTA EMBOSS http://emboss.sourceforge.net/download/
Methods of Alignment
29
By hand - slide sequences on two lines of a word
processor Dot plot
with windows
Rigorous mathematical approach Dynamic programming (slow, optimal) Heuristic methods (fast, approximate) BLAST and FASTA
Word matching and hash tables0
Alignment is additive
Observation: The score of aligning is additive Say that aligns to x1xi y1yj
30
x1xM
y1yN xi+1xM yj+1yN
The two scores add up: F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])
31
Basic principles of dynamic programming
32
- Creation of an alignment path matrix - Stepwise calculation of score values - Backtracking (evaluation of the optimal path)
It is applicable when a large search space can be structured into a succession of stages, such that the initial stage contains trivial solutions to sub-problems, each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage, the final stage contains the overall solution - Mount
Creation of an alignment path matrix
33
Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences Construct matrix F indexed by i and j (one index for each sequence) F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj Build F(i,j) recursively beginning with F(0,0) = 0
Creation of an alignment path matrix
35
If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) Three possibilities: xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj) xi is aligned to a gap, F(i,j) = F(i-1,j) - d yj is aligned to a gap, F(i,j) = F(i,j-1) - d
The best score up to (i,j) will be the largest of the three options
Dynamic Programming
36
There are only a polynomial number of subproblems Align x1xi to y1yj Original problem is one of the subproblems Align x1xM to y1yN Each subproblem is easily solved from smaller subproblems ??? Then, we can apply Dynamic Programming!!! Let F(i,j) = optimal score of aligning x1xi y1yj
Example of Dynamic programming algorithm
37
Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment
Example of Dynamic programming algorithm
38
Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment
Example of Dynamic programming algorithm
39
Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment
Example of Dynamic programming algorithm
40
Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment
Example of Dynamic programming algorithm
41
Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment
Example of Dynamic programming algorithm
42
Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment
43
Types of Scores
44
Scoring Matrix / Substitution Matrix
45
Scoring Matrix: Example
A A R N K 5 R -2 7 N -1 -1 7 K -1 3 0 6
Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein.
Conservation
Amino acid changes that tend to preserve the
Polar
physicochemical properties of the original residue
to polar aspartate glutamate Nonpolar to nonpolar alanine valine Similarly behaving residues leucine to isoleucine
Scoring Examples
48
Dynamic Programming
49
50
51
52
53
54
55
56
57
58
Scoring systems
59
DNA Scoring Systems -very simple
60
Sequence 1 Sequence 2
actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact
A A G C T 1 0 0 0
G 0 1 0 0
C 0 0 1 0
T 0 0 0 1
Match: 1 Mismatch: 0 Score = 5
Protein Scoring Systems
61
Sequence 1 Sequence 2
PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM
Scoring matrix
C C 9
S -1 T -1
4 1 -1 1 0 1 0 5 -1 0 -2 0 -1 7 -1 -2 -2 -1 4 0 -2 -2 6 0 -1 5 1 6
T:G = -2 T:T = 5 Score = 48
P -3 A 0
G -3 N -3 D -3 . .
Protein Scoring Systems
62
Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. tiny aliphatic I L hydrophobic aromatic charged M Y F W H C S+S V A P G G CSH T K R S D E Q positive polar N small
Protein Scoring Systems
63
Scoring matrices reflect: # of mutations to convert one to another chemical similarity observed mutation frequencies the probability of occurrence of each amino acid Widely used scoring matrices:
PAM BLOSUM
PAM 250
64
A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4 C C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1 W W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6 A R N D C Q E G H I L K M F P S T W Y V B Z
-8
17
PAM
65
Dayhoff Matrix
66
Dayhoff Matrix
67
Dayhoff Matrix
68
BLOSUM Matrix
69
BLOSUM Matrix
70
AABCDA...BBCDA DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC
Conserved blocks in alignments
BLOSUM Matrix
71
AABCDA...BBCDA DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC
Conserved blocks in alignments
72
TIPS on choosing a scoring matrix
73
Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches (Henikoff & Henikoff, 1993). When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62.
Significance of alignment
74
Significance of alignment
75
Database Searching
76
Database Searching
77
Local vs. Global Alignment
Global Alignment
--T-CC-C-AGT-TATGT-CAGGGGACACGA-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-CAGAT--C
Local Alignmentbetter alignment to find conserved segment
tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc
Global vs. Local Alignment
79
Global Alignment Needleman and Wunsch (1970)
Local Alignment Smith and Waterman (1981a)
Global Alignment
Two closely related sequences:
needle (Needleman & Wunsch) creates an end-to-end alignment.
Global Alignment
Two sequences sharing several regions of local similarity:
1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67 |||||||||||||| | | | |||| || | | | || 70 1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG
The Needleman-Wunsch Matrix
82
x1 xM
Every non-decreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences
y1 yN
An optimal alignment is composed of optimal subalignments
The Needleman-Wunsch Algorithm
1.
F(0, 0) = 0 , F(0, j) = - j d,
Initialization.
83
F(i, 0)= - i d
2.
a.
Main Iteration. Filling-in partial alignments
For each i = 1M For each j = 1N F(i, j) = max F(i-1,j-1) + s(xi, yj) [case 1] F(i-1, j) d [case 2] F(i, j-1) d [case 3] if [case 1] if [case 2] if [case 3]
DIAG, Ptr(i,j) = LEFT, UP,
3.
Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment
Global Alignment (Needleman -Wunsch)
The the Needleman-Wunsch algorithm creates a
global alignment over the length of both sequences (needle) Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.
Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared.
Global methods are useful when you want to force
two sequences to align over their entire length
Local Alignment (Smith-Waterman)
Local alignment Identify the most similar sub-region shared between two sequences Smith-Waterman
The local alignment problem
86
Given two strings
x = x1xM, y = y1yN
Find substrings x, y whose similarity (optimal global alignment value) is maximum
x = aaaacccccggggtta y = ttcccgggaaccaacc
Why local alignment examples
87
Genes are shuffled between genomes
Portions of proteins (domains) are often conserved
Cross-species genome similarity
88
98% of genes are conserved between any two mammals >70% average similarity in protein sequence
hum_a mus_a rat_a fug_a hum_a mus_a rat_a fug_a : : : : : : : : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ @ @ @ @ @ @ @ 57331/400001 78560/400001 112658/369938 36008/68174 57381/400001 78610/400001 112708/369938 36058/68174
hum_a mus_a rat_a fug_a hum_a mus_a rat_a fug_a
: : : : : : : :
AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG CCGAGGACCCTGA-------------------------------------
@ @ @ @ @ @ @ @
57431/400001 78659/400001 112757/369938 36084/68174 57481/400001 78708/400001 112806/369938 36097/68174
atoh enhancer in human, mouse, rat, fugu fish
The Smith-Waterman algorithm
89
Termination:
1.
If we want the best local alignment
FOPT = maxi,j F(i, j) Find FOPT and trace back
2.
If we want all local alignments scoring > t
For all i, j find F(i, j) > t, and trace back Complicated by overlapping local alignments
( WatermanEggert 87: find all non-overlapping local alignments with minimal recalculation of the DP matrix )
Smith-waterman algorithm
90
Basic Local Alignment Search Tool
91
BLAST Algorithm
92
BLAST Algorithm
93
BLAST Algorithm
94
BLAST Algorithm
95
Basic BLAST Algorithms
96