0% found this document useful (0 votes)

90 views96 pages

Bio Medical Tics - Sequence Analysis - Alignment - 2011

The document discusses sequence alignment methods. It defines sequence alignment as comparing two or more sequences to find patterns that are in the same order. It describes global and local alignments, and how dot plot and dynamic programming algorithms like Needleman-Wunsh and Smith-Waterman are used to compute optimal alignments through scoring matrices and gap penalties.

Uploaded by

黃柏翰

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views96 pages

Bio Medical Tics - Sequence Analysis - Alignment - 2011

Uploaded by

黃柏翰

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

Sequence Analysis

1 DNA SEQUENCING & SEQUENCE ALIGNMENT

Sequence Alignment

Study Goal
3

What is a sequence alignment? The difference between a global and local alignment and what the

uses of each are. How to use the dot matrix methods to analyze genes and chromosomes The steps performed by the Needleman-Wunsh and SmithWaterman Algorithms to produce a sequence alignment How to use scoring matrix values and gap penalties to produce a sequence alignment

Sequence Alignment
4

Definition: the comparison of two or more sequences by

searching for a series of individual characters or character patterns that are in the same order in the sequences
Sequence alignment score:

Sum of the individual log odds scores for each pair of aligned sequence characters in an alignment less a penalty for each gap of one more position

Pair-wise Sequence Alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Definition Given two strings

x = x1x2...xM,

y = y1y2yN,

an alignment is an assignment of gaps to positions 0,, N in x, and 0,, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence

What is a good alignment?

AGGCTAGTT, AGCGAAGTTT
AGGCTAGTTAGCGAAGTTT AGGCTA-GTTAG-CGAAGTTT AGGC-TA-GTTAG-CG-AAGTTT 6 matches, 3 mismatches, 1 gap

7 matches, 1 mismatch, 3 gaps

7 matches, 0 mismatches, 5 gaps

Evolution at the DNA level

Deletion Mutation ACGGTGCAGTTACCA AC----CAGTCCACCA SEQUENCE EDITS

REARRANGEMENTS Inversion Translocation Duplication

Evolutionary
9

next generation
OK OK OK X X
Still OK?

Scoring Function
10

Sequence edits: Mutations Insertions Deletions Scoring Function:

AGGCCTC AGGACTC AGGGCCTC AGG . CTC

Alternative definition: minimal edit distance Given two strings x, y, find minimum # of edits (insertions, deletions, mutations) to transform one string to the other

Match: Mismatch: Gap:

+m -s -d

Score F = (# matches) m - (# mismatches) s (#gaps) d

Pair-wise Alignment
11

The alignment of two sequences (DNA or protein) is

a relatively straightforward computational problem.

There are lots of possible alignments.

Two sequences can always be aligned. Sequence alignments have to be scored. Often there is more than one solution with the

same score.

How do we compute the best alignment?

12
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Too many possible alignments:

>> 2N

Local alignment vs. Global Alignment

Use to align the entire sequence Best for same length sequence

Local alignment vs. Global Alignment

Use to align the similar sequence along certain length Best for sequence sharing a conserved region or domain

What do we want alignment for?

Orthologous

Xenologous transferred genes

Matches to similar sequences

Sequence conservation Structure conservation function conservation

What is conserved in a gene [protein] family is functionally important

Due to purifying selection driven by functional constraints observable in a bckground described by the theory of neutral evolution Fast enough that pseudogenes rapidly deteriorate over evolutionary timescale In any prokaryotic genome, homologs from more than one distantly related species are detectable for 70 80 % of proteins

Application: Comparison of sequence/structures can identify

homologous relationships, allowing inference of function based on that relationship

Methods of Alignment
17

By hand - slide sequences on two lines of a word

processor Dot plot

with windows

Rigorous mathematical approach Dynamic programming (slow, optimal) Heuristic methods (fast, approximate) BLAST and FASTA

Word matching and hash tables0

Align by Hand
18

GATCGCCTA_TTACGTCCTGGAC <---> AGGCATACGTA_GCCCTTTCGC You still need some kind of scoring system to find the best alignment

Dotplot
19

A dotplot gives an overview of all possible alignments

A T T C A C A T A T A C A T T A C G T A C

Sequence 2

Sequence 1

Dotplot
20

In a dotplot each diagonal corresponds to a possible (ungapped) alignment

A T T C A C A T A T A C A T T A C G T A C

Sequence 2

Sequence 1

One possible alignment:

T A C A T T A C G T A C A T A C A C T T A

Insertions / Deletions in a Dotplot

Sequence 2

T A C T G T C A T T A C T G T T C A T

Dotplot
(Window = 13022Stringency = 9) /

Hemoglobin -chain

Word Size Algorithm

T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C

Word Size = 3
C T A T G A C A

T A C G G T A T G

Window / Stringency
24

Score = 11 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 11 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 7 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Typical window size for DNA sequences is 15 bases, and stringency or match requirement in this window is 10, meaning if there are 10-15 matches within the window, and a dot is printed at the first base in the windows

Scoring Matrix Filtering Matrix: PAM250 Window = 12 Stringency = 9

Dotplot
(Window = 18 /25 Stringency = 10) Hemoglobin -chain

Hemoglobin -chain

Considerations
26

The window/stringency method is more sensitive than the

wordsize method (ambiguities are permitted).

The smaller the window, the larger the weight of statistical

(unspecific) matches.
With large windows the sensitivity for short sequences is

reduced.
Insertions/deletions are not treated explicitly.

Dot Matrix
27

Unless two sequences are known to be very much alike, the

dot matrix method should be considered as a first choice for pair-wise sequence alignment Readily reveal the presence of insertions/deletions and direct and inverted repeats DNA sequence dot matrix comparison: long windows and high stringencies (7/11, 11/15) Protein sequences: short windows, stringencies (1/1) except for short domain of partial similaritity in not similar sequences (15/5)

Programs:

DOTTER http://www.cgb.ki.se/cgb/groups/sonnhammer/Dotter.h tml DOTPLOT - http://helix.nih.gov/docs/gcg/dotplot.html PLALIGN in FASTA EMBOSS http://emboss.sourceforge.net/download/

Methods of Alignment
29

By hand - slide sequences on two lines of a word

processor Dot plot

with windows

Rigorous mathematical approach Dynamic programming (slow, optimal) Heuristic methods (fast, approximate) BLAST and FASTA

Word matching and hash tables0

Alignment is additive
Observation: The score of aligning is additive Say that aligns to x1xi y1yj
30

x1xM

y1yN xi+1xM yj+1yN

The two scores add up: F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])

Basic principles of dynamic programming

- Creation of an alignment path matrix - Stepwise calculation of score values - Backtracking (evaluation of the optimal path)
It is applicable when a large search space can be structured into a succession of stages, such that the initial stage contains trivial solutions to sub-problems, each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage, the final stage contains the overall solution - Mount

Creation of an alignment path matrix

Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences Construct matrix F indexed by i and j (one index for each sequence) F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj Build F(i,j) recursively beginning with F(0,0) = 0

Creation of an alignment path matrix

If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) Three possibilities: xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj) xi is aligned to a gap, F(i,j) = F(i-1,j) - d yj is aligned to a gap, F(i,j) = F(i,j-1) - d

The best score up to (i,j) will be the largest of the three options

Dynamic Programming
36

There are only a polynomial number of subproblems Align x1xi to y1yj Original problem is one of the subproblems Align x1xM to y1yN Each subproblem is easily solved from smaller subproblems ??? Then, we can apply Dynamic Programming!!! Let F(i,j) = optimal score of aligning x1xi y1yj