Mining Sequence
Patterns in
Biological data
Bioinformatics
Applies Computer Technology in Molecular
biology
Develops algorithms and methods to manage
and analyze biological data
Effective methods are needed to compare
and align biological sequences and discover
sequential patterns
Type of data
DNA: helix-shaped molecule whose
constituents are two parallel strands of
nucleotides : Adenine (A), Cytosine (C), Guanine
(G), Thymine (T)
Proteins: Composed of 20 amino acids
Produced from DNA using 3 operations or transformations:
transcription, splicing and translation
Gene : Sequence of hundreds of individual
nucleotides arranged in a particular order
Genome : Complete set of genes of an
organism
Alignment of Biological
Sequences
Alignment given two or more input biological
sequences, identify similar sequences with long
conserved sub-sequences
Pair-wise Sequence alignment
Multiple Sequence Alignment
In nucleotides two symbols align if they are identical
In amino acids they align if identical / or one can be
derived from the other
Local Alignment Vs Global Alignment
Substitution matrix represent probability of substitution
Alignment score can be calculated
Need for alignment
Two sequences are homologous if they
share the same ancestor
Degree of similarity helps to determine
degree of homology
Helps to construct evolution tree or phylogenetic tree
Pairwise Alignment
Pairwise Alignment
Needleman-Wunsch Algorithm
Smith-Waterman Algorithm
Build up Optimal Sequences
Use Dynamic Programming
O(n2) Time Complexity
Dot matrix plot
Uses boolean matrices to represent alignments
that can be detected visually
2
O(n ) Time Complexity
Heuristic Algorithms
BLAST Basic Local Alignment Search Tool
FASTA Fast Alignment Tool
First locate high-scoring short stretches and extend
them
BLAST Local Alignment
Algorithm
Finds regions of local similarity between biosequences
Matches nucleotide / protein sequences to
sequence databases and calculates statistical
significance of matches
Breaks the sequences to be compared into
sequences of fragments (words) and seeks
matches between words
DNA word size 11 bases
Amino Acids 3 amino acids
Creates a hash table of matching words
Moves from exact matches to neighborhood words
Due to hashing O(n)
Variants : MEGABLAST (long alignments),
Discontinuous MEGABLAST (gapped
alignments- similar not identical), BLASTN
(Adjustable word size), BLASTP
Multiple Sequence
Alignment Methods
Goal To find common patterns among all
considered sequences
Applications
More complex than Pair wise alignment
To build gene / protein families
Identify amino acids which are essential sites
for structure and function
Multi-dimensional alignment / Approximate
alignment
Methods
Series of pair-wise alignments
Feng-Doolittle alignment
Computes all possible pair wise alignments by
dynamic programming
Constructs a Guide tree by clustering and
progressive alignment
Multiple Sequence alignment
Hidden Markov Models
HMM for Biological
Sequence Analysis
Finding CpG Islands
Methylation process converts
C in CpG to T
CpG occurrence rare
Methylation is suppressed around
start regions of genes
Areas with high concentration
CpG Islands
Given a short sequence is it
from a CpG island
Given a long sequence can
all CpG islands be found
Markov Chain
Probability of a symbol depends only on
previous symbol
Markov Chain model states and
transitions (probability)
Probability of a sequence x = x1x2xL
Hidden Markov Model
Used to find all CpG islands in a long DNA
Sequence
Merge two Markov chains and add transition
probabilities between the two states
Hidden Markov Model: states, transitions,
emission probabilities (probability of producing a
symbol at a state)
Hidden because the states visited in generating a
sequence are not known
Hidden Markov Models
Tasks
Evaluation: Given a sequence x
determine probability P(x) Forward
Algorithm
Decoding: Given a sequence, determine
most probable path through the model
Viterbi Algorithm
Learning: Given a model and training
sequences, find the model parameters
Baum Welch Algorithm