Ch08 GraphsDNAseq
Ch08 GraphsDNAseq
info
Graph Algorithms
in Bioinformatics
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Outline
Introduction to Graph Theory
Eulerian & Hamiltonian Cycle Problems
Benzer Experiment and Interal Graphs
DNA Sequencing
The Shortest Superstring & Traveling
Salesman Problems
Sequencing by Hybridization
Fragment Assembly and Repeats in DNA
Fragment Assembly Algorithms
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Bridges of Knigsberg
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Linear time
NP complete
He used trees
(acyclic connected
graphs) to enumerate
structural isomers
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Benzers work
Developed deletion
mapping
Proved linearity of
the gene
Demonstrated
internal structure of
the gene
Seymour Benzer, 1950s
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Benzers Experiment
Idea: infect bacteria with pairs of mutant T4
bacteriophage (virus)
Each T4 mutant has an unknown interval
deleted from its genome
If the two intervals overlap: T4 pair is
missing part of its genome and is disabled
bacteria survive
If the two intervals do not overlap: T4 pair
has its entire genome and is enabled
bacteria die
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
1. Start at primer
(restriction site)
2. Grow DNA chain
3. Include ddNTPs
4. Stops reaction at all
possible points
5. Separate products
by length, using gel
electrophoresis
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
DNA Sequencing
Shear DNA into
millions of small
fragments
Read 500 700
nucleotides at a
time from the small
fragments (Sanger
method)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Fragment Assembly
Computational Challenge: assemble
individual short fragments (reads) into a
single genomic sequence (superstring)
Complexity: NP complete
Note: this formulation does not take into
account sequencing errors
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
overlap=12
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
TSP ATC
SSP
AGT
0 2 1
CCA 1
AGT 1 CCA
ATC
1
ATCCAGT 2 2 2
TCC
CAG 1 TC
CAG
C
ATCCAGT
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
l-mer composition
Spectrum ( s, l ) - unordered multiset of all
possible (n l + 1) l-mers in a string s of length n
The order of individual elements in Spectrum ( s, l
) does not matter
For s = TATGGTGC all of the following are
equivalent representations of Spectrum ( s, 3 ):
{TAT, ATG, TGG, GGT, GTG, TGC}
{ATG, GGT, GTG, TAT, TGC, TGG}
{TGG, TGC, TAT, GTG, GGT, ATG}
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
l-mer composition
Spectrum ( s, l ) - unordered multiset of all
possible (n l + 1) l-mers in a string s of length n
The order of individual elements in Spectrum ( s, l
) does not matter
For s = TATGGTGC all of the following are
equivalent representations of Spectrum ( s, 3 ):
{TAT, ATG, TGG, GGT, GTG, TGC}
{ATG, GGT, GTG, TAT, TGC, TGG}
{TGG, TGC, TAT, GTG, GGT, ATG}
We usually choose the lexicographically maximal
representation as the canonical one.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
ATG CAGG TC C
Path visited every VERTEX once
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Path 1:
ATGCGTGGCA
Path 2:
ATGGCGTGCA
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
GT C
G
AT TG GC CA
GT C GT C
G G
AT TG GC AT TG GC
CA CA
GG GG
ATGGCGTGCA ATGCGTGGCA
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Euler Theorem
A graph is balanced if for every vertex the
number of incoming edges equals to the
number of outgoing edges:
in(v)=out(v)
Theorem: A connected graph is Eulerian if
and only if each of its vertices is balanced.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
a.
Start with an arbitrary
vertex v and form an
arbitrary cycle with unused
edges until a dead end is
reached. Since the graph
is Eulerian this dead end is
necessarily the starting
point, i.e., vertex v.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Shake
DNA fragments
Known
Vector location
Circular genome
(restriction
(bacterium, plasmid) + = site)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Cosmid 40,000
Electrophoresis Diagrams
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reading an Electropherogram
Filtering
Smoothening
Shotgun Sequencing
genomic segment
Fragment Assembly
reads
Read Coverage
Lander-Waterman model:
Assuming uniform distribution of reads, C=10 results in 1 gapped
region per 1,000,000 nucleotides
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Repeat Types
Low-Complexity DNA (e.g. ATATATATACATA)
Overlap-Layout-Consensus
Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA
Overlap
Find the best match between the suffix of one
read and the prefix of another
Overlapping Reads
Sort all k-mers in reads (k ~ 24)
TACA TAGATTACACAGATTAC T GA
|| ||||||||||||||||| | ||
TAGT TAGATTACACAGATTAC TAGA
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Solution:
Discard all k-mers that appear more than
t Coverage, (t ~ 10)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Layout
Repeats are a major challenge
Do two aligned fragments really overlap, or
are they from two copies of a repeat?
Solution: repeat masking hide the
repeats!!!
Masking results in high rate of misassembly
(up to 20%)
Misassembly means alot more work at the
finishing step
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Too dense:
Overcollapsed?
Inconsistent links:
Overcollapsed?
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Consensus
A consensus sequence is derived from a
profile of the assembled fragments
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Find a path visiting every VERTEX exactly once: Hamiltonian path problem
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Multiple Repeats
Repeat1 Repeat2 Repeat1 Repeat2
Can be easily
constructed with any
number of repeats
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Minimizing Errors
If an error exists in one of the 20-mer reads,
the error will be perpetuated among all of the
smaller pieces broken from that read.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Conclusions
Graph theory is a vital tool for solving
biological problems
References