0% found this document useful (0 votes)
32 views83 pages

CE6068 Lecture 5

This document discusses multiple sequence alignment. It defines multiple sequence alignment as aligning a set of DNA/protein sequences by inserting gaps to produce equal-length aligned sequences. The document introduces the concept of maximizing similarity or minimizing distance in multiple sequence alignments. It also defines the sum-of-pairs scoring method.

Uploaded by

林采玟
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views83 pages

CE6068 Lecture 5

This document discusses multiple sequence alignment. It defines multiple sequence alignment as aligning a set of DNA/protein sequences by inserting gaps to produce equal-length aligned sequences. The document introduces the concept of maximizing similarity or minimizing distance in multiple sequence alignments. It also defines the sum-of-pairs scoring method.

Uploaded by

林采玟
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

CE6068

Bioinformatics and Computational Molecular


Advanced Sequence Alignment
Chia-Ru Chung
Department of Computer Science and Information Engineering
National Central University
2024/4/10
Outline

• Quick Review
• Multiple Sequence Alignment
• Genome Alignment
• Database Search

1
Quick Review
20240327 Exercise #7
• In the context of sequence alignment, what does a gap represent?
(A) A similarity between two sequences (B) The end of a sequence
(C) An insertion or deletion in one of the sequences
(D) A mismatch between bases of two sequences

3
What Does a Gap Represent
• A gap in sequence alignment represents a space introduced into one or both sequences
to align them more effectively. The introduction of gaps allows for the alignment
algorithm to accommodate insertions or deletions (indels) that have occurred in one
sequence relative to the other. This is crucial for accurately reflecting evolutionary
changes and for identifying regions of similarity despite these genetic variations.
• The concept of a gap is fundamental in both global and local alignment strategies,
enabling the alignment process to maximize sequence similarity scores by penalizing
gaps less than mismatches, depending on the scoring system used.
4
Importance of Understanding Gaps
• Gaps account for evolutionary events such as insertions and deletions, which can have
significant implications for understanding genetic relationships, evolutionary history,
and functional similarities among sequences.
• Properly penalizing gaps in the scoring system of an alignment algorithm is essential
for balancing the trade-off between aligning similar regions and introducing too many
gaps, which could distort the biological significance of the alignment.

5
Insertion
• Definition: An insertion refers to the addition of one or more nucleotides (in DNA/RNA) or
amino acids (in proteins) into a sequence that are not present in a compared sequence. In the
context of alignment, it is viewed relative to another sequence; what is an insertion in one
sequence may be considered the “original” sequence in another.
• In Alignment: To align sequences when an insertion is present, a gap is introduced in the
opposite sequence at the corresponding position. This allows the rest of the sequences to be
aligned and compared accurately.
• Biological Significance: Insertions can significantly affect gene function and protein structure.
They may result in frameshift mutations if occurring in coding regions, altering the
downstream amino acid sequence and potentially leading to nonfunctional proteins.
6
Deletion
• Definition: A deletion is the removal of one or more nucleotides or amino acids from a
sequence compared to another. Similar to insertions, deletions are defined relative to another
sequence.
• In Alignment: When a sequence has a deletion relative to its comparison sequence, gaps are
introduced into the alignment of the sequence lacking the nucleotides/amino acids to maintain
alignment integrity.
• Biological Significance: Deletions, like insertions, can have profound effects on gene function
and protein structure. Deletions in coding regions can lead to frameshift mutations, changing
the entire amino acid sequence downstream of the deletion. 7
How to Identify Insertions and Deletions
• Insertion: When a gap in one sequence aligns with a nucleotide/amino acid in the
other sequence, it suggests an insertion relative to the sequence with the gap.
• Deletion: Conversely, a gap aligning with a nucleotide/amino acid indicates a deletion
in the sequence where the gap is placed, relative to the other sequence.

‣ The gap in S1 aligns with “T” in S2, suggesting “T” is inserted in


S1: ACTG-AC S2 relative to S1.
S2: A-CGTGC ‣ The gap in S2 aligns with “C” in S1, indicating “C” has been
deleted from S2 relative to S1.
8
20240327 Exercise #13
• Which sequencing technology is used for analyzing the entire genome of an organism
to discover genetic variations?
(A) Whole-Genome Sequencing (B) Sanger Sequencing
(C) RNA Sequencing (D) Metagenomics
‣ While Sanger sequencing has been instrumental in the development of molecular
genetics and remains valuable for specific applications, it does not offer the scalability,
throughput, or cost-efficiency required for whole-genome analysis to discover genetic
variations.
‣ Whole-Genome Sequencing, leveraging NGS technologies, is specifically designed
for this purpose, making it the appropriate answer. 9
20240327 Exercise #18
• Which technology is most suitable for identifying new genetic variations within a
genome?
(A) DNA Microarrays (B) Sanger Sequencing
(C) Next-Generation Sequencing (NGS) (D) Polymerase Chain Reaction (PCR)
‣ Sanger Sequencing has been instrumental in the field of genetics for decades and was
used in the Human Genome Project. It's accurate and reliable for sequencing short
DNA fragments (up to 900 base pairs).
‣ However, it's relatively low throughput and more time-consuming and expensive per
base sequenced compared to NGS. This makes it less suitable for screening entire
genomes for new genetic variations across numerous samples. 10
Sanger Sequencing
• Sanger sequencing is highly accurate and has been the gold standard for sequencing individual DNA
fragments. It's commonly used for sequencing small genomes, confirming the accuracy of sequencing results
obtained by other methods, or reading specific genome regions.
• The main limitation of Sanger sequencing is its low throughput compared to next-generation sequencing
(NGS) technologies. Sanger sequencing is not designed for the high-throughput analysis required to
sequence entire genomes, especially large and complex ones, in a cost-effective and time-efficient manner.
Analyzing an entire genome to discover genetic variations requires sequencing millions of base pairs, which
would be impractical and exceedingly time-consuming with Sanger sequencing.
• Sanger sequencing is significantly more expensive than NGS technologies for whole-genome analysis on a
per-base or per-genome basis. The cost-effectiveness of NGS makes it the preferred choice for
comprehensive genome-wide studies.
11
Whole-Genome Sequencing
• Whole-Genome Sequencing (WGS) involves sequencing the entire genomic DNA of
an organism. It utilizes high-throughput sequencing technologies (often categorized
under NGS) that can sequence billions of DNA base pairs in parallel, making it
feasible to quickly and accurately capture the complete genomic landscape.
• WGS is specifically designed to detect genetic variation across the genome, including
SNPs, indels, structural variants, and copy number variations. It provides a
comprehensive view of an organism's genetic makeup, allowing the identification of
variations that may contribute to traits, disease, and evolutionary history.
12
20240327 Exercise #19
• In a Smith-Waterman local alignment, what condition prompts the algorithm to cease
the traceback process?
(A) Reaching a cell with a non-negative score
(B) Encountering a gap penalty
(C) Reaching a cell with a score of zero
(D) Completing the alignment of all characters in the shorter sequence

13
Smith-Waterman Algorithm
• The Smith-Waterman algorithm is designed for local sequence alignment.
• The termination of the traceback process in the Smith-Waterman algorithm upon
reaching a cell with a score of zero ensures that only the segment of the sequences
contributing to the highest local alignment score is considered, disregarding any
sequence parts that do not contribute to a significant local alignment.
• This characteristic makes the Smith-Waterman algorithm particularly suited for
identifying regions of similarity that might be biologically relevant within larger
sequences or across different sequences.
14
Multiple Sequence Alignment
Multiple Sequence Alignment
• Consider a set of k DNA/protein sequences S = {S1, S2, …, Sk}.
• A multiple alignment of S is a set of k equal-length sequences {S1′ , S2′ , …, S𝑘′ }.
S𝑖′ is obtained by inserting gaps in to Si.
• The Multiple Sequence Alignment (MSA) problem is to find an alignment which
maximizes a certain similarity function or minimizes a certain distance function.
• One popular similarity function is the Sum-of-Pair (SP) score

16
Sum-of-Pair (SP) Score (1/2)
• Consider the multiple alignment S′ of S.
• SP-score(a1, …, ak) = Σ1≤i<j≤kδ(ai, aj),
where ai can be any character or a space.
• The SP-score of S′ is
ΣxSP-score(S1′ [x], …, S𝑘′ [x]).

17
Sum-of-Pair (SP) Score (2/2)
• Assume the mismatch and match scores are 2 and −2, respectively.
• S1 = ACG--GAGA
• S2 = -CGTTGACA
• S3 = AC-T-GA-A
• S4 = CCGTTCAC-
• For position 1, SP-score(A,-,A,C) = 2δ(A,-) + 2δ(A,C) + δ(A,A) + δ(C,-) = -8
• SP-score= -8+12+0+0–6+0+12–10+0 = 0

18
Sum-of-Pair (SP) Distance
• Sum-of-Pair (SP) distance is a distance function.
• SP-dist(a1, …, ak) = Σ1≤i<j≤kδ(ai, aj),
where ai can be any character or a space.
• The SP-dist of S′ is
ΣxSP-dist(S1′ [x], …, S𝑘′ [x]).
• The problem of finding a multiple alignment which maximizes the SP score (or
minimizes the SP distance) is known to be NP-hard.
• It is unlikely to have a polynomial time solution for this problem.
19
Methods for Solving the MSA Problem
• Multiple sequence alignment methods can be classified as four types:
1) global optimization methods: the basic approach is dynamic programming
2) approximation algorithms
3) heuristic methods
4) probabilistic methods: assume certain model and try to learn the model parameters

20
Dynamic Programming Method (1/3)
• Let V(i1, i2, …, ik) = the SP-score of the optimal alignment of S1[1..i1], S2[1..i2], …,
Sk[1..ik].
• Observation: The last column of the optimal alignment should be either Sj[ij] or ‘-’.
• Hence, the score for the last column should be SP-score(S1[b1i1], S2[b2i2], …, Sk[bkik])
For (b1, b2, …, bk) ∈{0,1}k.
(Assume that Sj[0] = ‘-’.)

21
Dynamic Programming Method (2/3)
• Let V(i1, i2, …, ik) = the SP-score of the optimal alignment of S1[1..i1], S2[1..i2], …,
Sk[1..ik].
• Based on the observation, we have V(i1, i2, …, ik) = max(b1, b2, …, bk) ∈{0,1}k
{ V(i1-b1, …, ik-bk) + SP-score(S1[b1i1], …, Sk[bkik])}
• The SP-score of the optimal multiple alignment of S={S1, S2, …, Sk} is
V(n1, n2, …, nk), where ni is the length of Si.

22
Dynamic Programming Method (3/3)
• By filling-in the dynamic programming table,
We compute V(n1, n2, …, nk).
• By back-tracing,
We recover the multiple alignment.

23
Dynamic Programming Method – Complexity

• Time:
‐ The table V has n1n2…nk entries.
‐ Filling in one entry takes 2kk2time.
‐ Total running time is O(2kk2 n1n2…nk).
• Space:
‐ O(n1n2…nk) space to store the table V.
• Dynamic programming is expensive in both time and space. It is rarely used for
aligning more than 3 or 4 sequences.
24
Center Star Method
• The Center Star Method is a heuristic approach for solving the Multiple Sequence
Alignment (MSA) problem.
• It is particularly useful when aligning a small number of sequences or when a quick,
approximate solution is needed.
• This method is built on the idea of selecting one sequence as the “center” and aligning
all other sequences to this center sequence.
• The alignment of these sequences to the center forms the basis of the multiple
sequence alignment.
25
Steps in the Center Star Method (1/4)
1. Pairwise Alignments and Scoring:
• Initially, perform pairwise alignments for all pairs of sequences.
• Use a scoring scheme to calculate the score for each pairwise alignment. This
scheme typically involves rewards for matches and penalties for mismatches and
gaps.
• The total score of aligning a sequence to all other sequences is summed up.

26
Steps in the Center Star Method (2/4)
2. Selecting the Center Sequence
• The sequence with the highest total score (i.e., the sequence that, when aligned
pairwise with all other sequences, yields the highest sum of scores) is chosen as the
center sequence.
• The rationale behind selecting the highest-scoring sequence as the center is that it is
likely to be most similar to all other sequences, minimizing the overall alignment
score penalties.

27
Steps in the Center Star Method (3/4)
3. Constructing the Multiple Sequence Alignment:
• Align each of the non-center sequences to the center sequence.
• Introduce gaps in both the center sequence and the aligned sequences as necessary
to maximize the alignment scores based on the scoring scheme used.
• It is important to note that, while the original center sequence may not have had
gaps, gaps may be introduced into its copies in the final MSA to align with the
other sequences.

28
Steps in the Center Star Method (4/4)
4. Finalizing the MSA:
• The final MSA is constructed by combining the individual alignments of each
sequence to the center sequence.
• Adjustments are made to ensure that gaps introduced in any alignment are reflected
across all sequences, maintaining the alignment's integrity.

29
Center Star Method

30
Example of Center Star Method (1/2)
• Consider the following sequences to be aligned using the Center Star Method:
Sequence A: AGT
Sequence B: AGCT
Sequence C: AAT
• Assuming Sequence B is selected as the center because it has the highest total alignment score with
Sequences A and C, the next step is to align A and C to B:
Center Sequence B: AGCT
Align A to B: AG-T
Align C to B: A-AT
• The final MSA would reflect adjustments to maintain alignment integrity, potentially introducing gaps
into the center sequence as needed when aligning with other sequences.
31
Example of Center Star Method (2/2)

32
Advantages and Limitations of Center Star Method

• Advantages:
‐ Simplicity and speed, making it suitable for initial exploratory alignments or when dealing
with a small number of sequences.
‐ Effectiveness in scenarios where one sequence is indeed central or most representative of
the set.
• Limitations:
‐ The quality of the MSA heavily depends on the choice of the center sequence. A poor
choice can lead to suboptimal alignments.
‐ It does not guarantee an optimal solution, especially as the number of sequences increases.
33
Progress Alignment Method (1/2)
• Progress alignment method is first proposed by Feng and Doolittle (1987).
• It is a heuristics to get a good multiple alignment.
• Basic idea:
‐ Align the two most closest sequences
‐ Progressive align the most closest related sequences until all sequences are aligned.
• Examples of Progress alignment method include:
‐ ClustalW, T-coffee, Probcons
‐ Probcons is currently the most accurate MSA algorithm.
‐ ClustalW is the most popular software.
34
Progress Alignment Method (2/2)
1. Compute pairwise distance scores
for all pairs of sequences
2. Generate the guide tree which
ensures similar sequences are nearer
in the tree
3. Align the sequences one by one
according to the guide tree.

35
ClustalW
• ClustalW is one of the most widely used tools for performing Multiple Sequence
Alignments (MSA) .
• It is designed to handle the alignment of multiple sequences by first constructing a
guide tree to represent the evolutionary relationships between sequences and then
aligning the sequences according to this tree.
• ClustalW simplifies the complex task of MSA into manageable steps, making it
accessible to researchers and practitioners across various fields of biology and
bioinformatics.
36
How ClustalW Works (1/2)
1. Pairwise Alignment Scores:
Initially, ClustalW performs all possible pairwise alignments between the sequences using a
dynamic programming algorithm, similar to the Needleman-Wunsch algorithm for global
alignments. These alignments are not saved but are used to calculate a distance matrix
representing the divergence between each pair of sequences.
2. Guide Tree Construction:
Using the distance matrix, a guide tree is constructed using the Neighbor-Joining method.
This tree represents the evolutionary pathway of the sequences, with closely related
sequences being placed closer together.
37
How ClustalW Works (2/2)
3. Progressive Alignment:
Starting with the most closely related sequences as indicated by the guide tree, ClustalW
aligns sequences or groups of sequences step by step. This is done by following the branches
of the tree, ensuring that sequences with a common ancestor are aligned before being aligned
to more distant relatives.
4. Adjustments and Refinement:
Although not originally a feature in the earliest versions, later iterations of Clustal and its
successors, like Clustal Omega, have incorporated iterative refinement techniques. These
adjustments help improve the alignment by re-aligning regions that can be optimized further.
38
Key Features of ClustalW
• User-Friendly: ClustalW is designed to be accessible, requiring minimal input from
the user to perform complex alignments.
• Versatility: It can be used for DNA, RNA, and protein sequences.
• Gap Penalties: It allows for customizable gap penalties, including gap opening and
extension penalties, to fine-tune alignments.
• Weighting: ClustalW can apply weights to sequences in the alignment process to
correct for sampling bias or overrepresentation.

39
Profile-Profile Alignment
• Profile-Profile alignment is an advanced method used for aligning two sequence
profiles.
• A sequence profile is a statistical representation that captures the variations observed
across multiple alignments of sequences.
• Instead of aligning individual sequences, profile-profile alignment seeks to align these
statistical representations, which can provide a more nuanced understanding of the
evolutionary and functional relationships between groups of sequences.

40
Concept of Profile-Profile Alignment
• Profile-Profile alignment extends the idea of sequence alignment into the realm of
aligning groups of sequences.
• Each profile is essentially a matrix where rows represent positions in the alignment
and columns correspond to the 20 standard amino acids plus gaps, detailing the
frequency or probability of each amino acid at each position.
• These profiles encapsulate information about conserved regions, variable regions, and
the typical insertions or deletions (indels) that may occur in a group of related
sequences.
41
Steps in Profile-Profile Alignment (1/2)
1. Profile Construction:
Initially, multiple sequence alignments (MSAs) are generated for two sets of related
sequences. From these MSAs, profiles are constructed that capture the consensus information,
including the conservation and variability of each position.
2. Scoring System:
A scoring system is then devised to compare profiles. This system accounts for the
probabilities of amino acids at each position in the profiles, incorporating substitutions,
insertions, and deletions. The scoring may include sophisticated statistical models that reflect
evolutionary distances.
42
Steps in Profile-Profile Alignment (2/2)
3. Alignment Algorithm:
Similar to sequence alignment, dynamic programming can be adapted to perform profile-
profile alignments. However, the computation now involves comparing matrices (profiles)
rather than sequences. The goal is to find an alignment that maximizes the overall alignment
score, which is computed based on the scoring system that compares profiles.
4. Optimization and Iteration:
The alignment process may include optimization steps, such as iterative refinement, where
the initial profile-profile alignment is used to generate a new, combined MSA. From this
MSA, a new profile is constructed, and the process may be repeated to improve the alignment.
43
Limitation of Progressive Alignment Construction

• Since the progressive method is a heuristic which does not realign the sequences, the
multiple sequence alignment is bad if we have a poor initial alignment.
• The method is sensitive to the distribution of the sequences in the sequence set.
• Progressive alignment does not guarantee convergence to the global optimal.

44
Iterative Methods for MSA (1/2)
• The iterative method for Multiple Sequence Alignment (MSA) seeks to improve upon
the initial alignment iteratively.
• The process begins with generating an initial multiple alignment using a method such
as progressive alignment.
• Once this initial alignment is established, the iterative process aims to refine and
improve this alignment to better reflect the actual sequence relationships.

45
Iterative Methods for MSA (2/2)
• Iterative methods involve the following basic steps:
‣ Initial Multiple Alignment Generation: An initial alignment is created, possibly
using a simple or heuristic approach.
‣ Iterative Improvement: The initial alignment is then refined through a series of
iterations. In each iteration, the alignment is adjusted in an attempt to improve its
overall quality, typically measured by some score like the sum-of-pairs score.

46
Examples of Iterative Methods
• PRRP
• MAFFT
• MUSCLE: MUSCLE is an iterative method that optimizes the sum-of-pairs score with an
affine gap penalty. It involves three stages:
1. Draft Progressive: A preliminary multiple alignment is created using a progressive alignment
method.
2. Improved Progressive: The alignment from the first stage is refined using an improved progressive
alignment technique.
3. Refinement: The alignment is further refined to maximize the sum-of-pairs score using tree-
dependent restricted partitioning.
47
Benefits of the Iterative Approach
• The key advantage of iterative methods over basic progressive alignment is their ability to
revisit and adjust earlier alignment decisions. This iterative refinement often leads to a more
accurate representation of the sequence relationships, especially when initial alignments may
have been suboptimal.
• Iterative methods like MUSCLE use sophisticated scoring functions, such as the log-
expectation score, to better handle alignment of sequences with varying levels of similarity
and complexity. This scoring mechanism considers not just the amino acids themselves but
also the frequency of gaps and the observed frequencies of amino acids in the alignments,
making it more nuanced than simpler scoring schemes. 48
Genome Alignment
Genome Alignment
• Definition:
Genome alignment involves comparing the entire genome sequences of two or more
organisms to identify regions of similarity and difference.
• Importance:
Essential for studying evolutionary relationships, identifying conserved genes, understanding
genomic variations, and detecting evolutionary events like mutations, duplications, and
rearrangements.
• Challenge:
High computational complexity due to the large size of genomes and the need for efficient
algorithms to manage time and space requirements effectively.
50
Goals of Genome Alignment
• To identify orthologous genes or regions (genes in different species that originated
from a single gene in the last common ancestor).
• To predict functionally important regions by comparing genomes of closely related
species.
• To understand the mechanisms of genome evolution and the effects of rearrangements
and mutations over time.

51
Overview of Genome Alignment Process
• Step 1:
Identification of potential anchors or conserved regions through shared k-mers or
other methods.
• Step 2:
Aligning these conserved regions accurately.
• Step 3:
Extending alignments to include non-conserved regions, thus bridging the gaps
between conserved anchors.
52
Challenges in Genome Alignment
• Computational demands:
Due to the vast amount of data, aligning complete genomes requires significant
computational resources.
• Complexity:
The presence of repetitive sequences, structural variations, and evolutionary
divergences adds to the complexity of alignment.
• Alignment accuracy:
Balancing speed and accuracy to ensure meaningful biological interpretation.
53
Tools for Genome Alignment
• Mention of key tools such as MUMmer, Mutation Sensitive Alignment, SSAHA,
AVID, MGA, BLASTZ, and LAGAN.
• Brief description of how these tools specialize in handling the large-scale data and
complexities of genome alignment.

54
Maximum Unique Match (1/3)
• Although conserved regions shared by two genomes rarely contain exactly the same
sequence, the conserved regions of two genomes usually share some short common
substrings which are unique in the two genomes.
• Maximal Unique Match (MUM) was proposed to be an anchor for defining candidate
conserved regions.

55
Maximum Unique Match (2/3)
• Definition:
A Maximum Unique Match (MUM) is the longest substring that occurs exactly once
in each of the genomes being compared.
• Role in genome alignment:
Acts as an anchor for identifying conserved regions across genomes.
• Importance:
Facilitates efficient comparison by narrowing down the focus to highly probable
conserved areas, reducing computational complexity.
56
Maximum Unique Match (3/3)
• DEFINITION 4.1
Given two genomes A and B, Maximal Unique Match (MUM) substring is a common
substring of A and B of length longer than a specific minimum length d (by default, d
= 20) such that
‐ it is maximal, that is, it cannot be extended on either end without incurring a
mismatch; and
‐ it is unique in both sequences.

57
Examples of Finding MUMs
• S = acgactcagctactggtcagctattacttaccgc#
• T = acttctctgctacggtcagctattcacttaccgc$
• There are four MUMs:
ctc, gctac, ggtcagctatt, acttaccgc.
• Consider S=acgat#, T=cgta$
• There are two MUMs: cg and t

58
How to Find MUMs
• Brute-force approach:
Input: Two genome sequences S[1..m1] and T[1..m2]
For every position i in S
For every position j in T
Find the longest common prefix P of S[i..m1] and T[j..m2]
Check whether |P|≥d and whether P is unique in both genomes. If yes, report it
as a MUM.
• This solution requires at least O(m1m2) time.
59
Finding MUMs by Suffix Tree
• MUMs can be found in O(m1+m2) time by suffix tree!
1. Build a generalized suffix tree for S and T
2. Mark all the internal nodes that have exactly two leaf children, which represent both
suffixes of S and T.
3. For each marked node, WLOG, suppose it represents the i-th suffix Si of S and the j-
th suffix Tj of T. We check whether S[i-1]=T[j-1]. If not, the path label of this
marked node is an MUM.

60
Example of Finding MUMs
• Consider S=acgat#, T=cgta$
• Step 1: Build the generalized suffix tree for S and T.

61
Example of Finding MUMs
• Step 2: Mark all the internal nodes that have exactly two leaf children, which
represent both suffixes of S and T.

62
Example of Finding MUMs
• Step 3: For each marked node, WLOG, suppose it represents the i-th suffix Si of S and
the j-th suffix Tj of T. We check whether S[i-1]=T[j-1]. If not, the path label of this
marked node is an MUM.

63
Database Search
Biological Databases
• Biological data is double in size every 15 or 16 months
• Increasing in number of queries: 40,000 queries per day
• Therefore, we need to have some efficient searching methods for genomic databases

65
Problem Definition
• Given a database D of genomic (or protein) sequences and a query string Q, the goal
is to find the string(s) S in D which is/are the closest match(es) to Q using a given
scoring function.
• The scoring function can be either:
‐ Semi-Global Alignment Score: The best possible alignment score between a substring A of
S and Q.
‐ Local Alignment Score: The best possible alignment score between a substring A of S and
a substring B of Q.
66
Measures Used to Evaluate the Effectiveness
• Sensitivity
Ability to detect “true positive”.
Sensitivity can be measured as the probability of finding the match given the query and the
database sequence has only x% similarity.
• Specificity
Ability to reject “false positive”
Specificity is related to the efficiency of the algorithm.
• A good search algorithm should be both sensitive and specific
67
Types of Algorithms
• Exhaustive approach
Smith-Waterman Algorithm
• Heuristic methods
FastA
BLAST and BLAT
PatternHunter
• Filter and refine approaches
LSH
QUASAR
• BWT-SW
68
Smith-Waterman Algorithm
• Input:
the database D (total length: n) and
the query Q (length: m)
• Output: all closest matches (based on local alignment)
‣ Time: O(nm)
Algorithm
‣ This is a brute force algorithm.
• For every sequences S in the database, So, it is the most sensitive algorithm.
Use Smith-Waterman algorithm to compute the best local alignment between S and Q

• Return all alignments with the best score 69


What is FastA
• Given a database and a query,
FastA does local alignment with all sequences in the database and return the good
alignments
Its assumption is that good local alignment should have some exact match
subsequences.

70
BLAST
• BLAST = Basic Local Alignment Search Tool
• Input:
A database D of sequences
A sequence s
• Aim of BLAST:
Compare s against all sequences in D in an reasonable time based on heuristics. Faster than
FastA
• Disadvantage of BLAST:
To be fast, it scarifies the accuracy. Thus, less sensitive
71
BLAST1
• BLAST1 aims to report all database sequences S in D such that the maximal segment
pair (MSP) score of S and Q is bigger than some threshold α.
• BLAST1 is a heuristic which tries to achieve the aim.
• There are 3 steps in the BLAST1 algorithm:
Step 1: Query preprocessing
Step 2: Scan the database for hits
Step 3: Extension of hits

72
High Scoring Segment Pair
• A segment pair is called a high scoring segment pair (HSP) if we cannot extend it to
give a higher score. Among all high scoring segment pairs of S1 and S2, the HSP with
the maximum score is called the maximal segment pair (MSP).

73
Step 1: Query Preprocessing
• For every position p of the query, find the list of w-tuples (length-w strings) scoring
more than a threshold T when paired with the word of the query starting at position p.
This list of w-tuples are called neighbors.
• For DNA, w=11(default)

74
Step 2: Scan the Database for Hits
• Scan the database DB.
For each position p of the query, if there is an exact match between the neighbors of p
and a w-tuple in DB, a hit is made.
• A hit is characterized by the positions in both query and DB sequences.

75
Step 3: Extension of Hits
• For every hit, extend it in both directions, without gaps.
• The extension is stopped as soon as the score decreases by more than X (parameter of
the program) from the highest value reached so far.
• If the extended segment pair has score better than or equal to S(parameter of the
program), it is called an HSP (High scoring segment pair). Then, they will be reported.
• For every sequence in the database, the best scoring HSP is called the MSP (Maximal
segment pair).

76
Statistics for Local Alignment
• A local alignment without gaps consists simply of a pair of equal length segments.
• BLAST and FASTA find the local alignments whose score cannot be improved by extension.
In BLAST, such local alignments are called high-scoring segment pairs or HSPs.
• To determine the significant of the local alignments, BLAST and FASTA show E-value and
bit score. Below, we give a brief discussion on them.
• Assumption: We required the expected score for aligning a random pair of residues/bases to
be negative.
Otherwise, the longer the alignment, the higher is the score independent of whether the segments
aligned are related or not.
77
E-value
• E-value is the expected number of alignments having raw score > S totally at random.
• Let m and n be the lengths of the query sequence and the database sequence,
respectively.
• When both m and n are sufficiently long, the expected number E of HSPs with a score
• at least S follows the extreme distribution (Gumbel distribution).
• We have E = Kmne−λS for some parameters K and λ which depend on the scoring
matrix and the expected frequencies of the residues/bases.

78
Bit Score
• The raw score S of an alignment depends on the scoring system.
• Without knowing the scoring system, the raw score is meaningless.
• The bit score is defined to normalize the raw score, which is defined as:

𝜆𝑆 − 𝑙𝑛𝐾
𝑆′ =
𝑙𝑛2
• Note that E = mne−S′. Hence, when S′ is big, the HSP is significant.

79
p-value
• The number of random HSPs with score ≥ S follows a Poisson distribution.
• Pr(exactly x HSPs with score ≥ S) = e−EEx/x!, where E = Kmne−λS is the E-score.
• p-value = Pr(at least 1 HSP with score ≥ S) = 1 − e−E.
‐ When the E-value increases, the p-value is approaching 1.
‐ When E = 3, the p-value is 1−e−3 = 0.95.
‐ When E = 10, the p-value is 1−e−10 = 0.99995.
‐ When E < 0.01, the p-value is 1 − e−E ≈ E.
• In BLAST, the p-value is not shown since we expect the p-value and the E-value are
approximately the same when E < 0.01 while the p-value is almost 1 when E > 10.
80
Q &A
Thank you!

You might also like