Sequence Alignment Algorithms Review
Sequence Alignment Algorithms Review
1093/bib/bbq015
Advance Access published on 11 May 2010
Abstract
Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis
of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety
of alignment algorithms and software have been subsequently developed over the past two years. In this article,
we will systematically review the current development of these algorithms and introduce their practical applications
Corresponding author. Heng Li, Broad Institute, 5 Cambridge Center, Cambridge, MA 02142, USA. Tel: 617 714 7000;
Fax: 617 714 8102; E-mail: hengli@[Link]
Heng Li is a postdoctoral researcher at the Broad Institute and used to work at the Sanger Institute where he developed several popular
alignment algorithms.
Nils Homer is a PhD student in Computer Science and Human Genetics departments of UCLA. He developed the BFAST alignment
algorithm.
ß The Author 2010. Published by Oxford University Press. For Permissions, please email: [Link]@[Link]
474 Li and Homer
progress on general alignment techniques and then A seed allowing internal mismatches is called spaced
examine their applications in the context of specific seed; the number of matches in the seed is its weight.
sequencing platforms and experimental designs. Eland (A.J. Cox, unpublished results) was the first
We will use simulated data to evaluate the necessity program that utilized spaced seed in short-read align-
of gapped alignment and paired-end mapping, and ment. It uses six seed templates spanning the entire
present a list of alignment software that are actively short read such that a two-mismatch hit is guaranteed
maintained and widely used. Finally, we will discuss to be identified by at least one of the templates, no
the future development of alignment algorithms. matter where the two mismatches occur. SOAP [17]
adopts almost the same strategy except that it indexes
the genome rather than reads. SeqMap [18] and
OVERVIEW OF ALIGNMENT MAQ [19] extends the method to allow k-mis-
ALGORITHMS matches, but to be fully sensitive to k-mismatch
Most of fast alignment algorithms construct auxiliary hits, they require ð2kkÞ templates, which is exponential
data structures, called indices, for the read sequences in k and thus inefficient given large k. To improve
or the reference sequence, or sometimes both. the speed, MAQ only guarantees to find
Depending on the property of the index, alignment two-mismatch hit in the first 28 bp of each read,
algorithms can be largely grouped into three cate- the most reliable part of an Illumina read. It extends
A B C
D E
Figure 1: Data structures based on a prefix trie. (A) Prefix trie of string AGGAGC where symbol Œ marks the
for example, would also work with suffix tree index reference sequence. However, a trie takes OðL2 Þ
in principle. space where L is the length of the reference. It is
impractical to build a trie even for a bacterial
genome. Several data structures are proposed to
Trie, prefix/suffix tree and FM-index reduce the space. Among these data structures, a
A suffix trie, or simply a trie, is a data structure that suffix tree (Figure 1C) is most widely used. It
stores all the suffixes of a string, enabling fast string achieves linear space while allowing linear-time
matching. To establish the link between a trie and an searching. Although it is possible in theory to repre-
FM-index, a data structure based on Burrows- sent a suffix tree in Llog2 L þ OðLÞ bits using
Wheeler Transform (BWT) [40], we focus on rank-selection operations [41], even the most space
prefix trie which is the trie of the reverse string. efficient implementation of bioinformatics tools re-
All algorithms on a trie can be seamlessly applied quires 12–17 bytes per nucleotide [42], making it
to the corresponding prefix trie. impractical to hold the suffix tree of the human
Figure 1A gives the prefix trie of AGGAGC. genome in memory.
Finding all exact matches of a query sequence is To solve this problem, Abouelhoda et al. [38]
equivalent to searching for a path descending from derived an enhanced suffix array that consists of a
the root where each edge label on the path matches a suffix array and several auxiliary arrays, taking
query letter in the reverse order. If such a path exists, 6.25 bytes per nucleotide. It can be regarded as an
the query is a substring. Given a query AGC, for implicit representation of suffix tree, and has an iden-
example, the path matching the query is [0, 6]! tical time complexity to suffix tree in finding exact
[3, 3]![5, 5]![1, 1]. matches, better than the suffix array originally in-
The time complexity of determining if a query vented by Manber and Myers [43].
has an exact match against a trie is linear in the length A further improvement on memory is achieved
of the query, independent of the length of the by Ferragina and Manzini [39] who proposed the
Sequence alignment algorithms for next-generation sequencing 477
FM-index and found that locating a child of a parent by dynamic programming. BWA-SW furthers
node in the prefix trie can be done in constant time BWT-SW by representing the query as a directed
using a backwards search on this data structure. Thus word graph (DAWG) [52], which also enables it to
the time complexity of finding exact matches with deploy heuristics to accelerate alignment.
an FM-index is identical to that with a trie. With Bowtie and BWA also sample short substrings of
respect to memory, the FM-index was originally de- the reference, but instead of performing dynamic
signed as a compressed data structure such that the programming, they compare the query and sampled
theoretical index size can be smaller than the original substrings only allowing a few differences. In add-
string if the string contains repeats (equivalently, has ition, as they require the entire read to be aligned,
small entropy). The FM-index is usually not com- the traversal of the trie can be bounded since it is
pressed for better performance during alignment unnecessary to descend deeper in the trie if it can be
since DNA sequences have a small alphabet. The predicted that doing so leads to an alignment with
practical memory footprint of an FM-index is typic- excessive mismatches and gaps. Alternatively, Bowtie
ally 0.5–2 bytes per nucleotide, depending on im- and BWA can be considered to enumerate all com-
plementations and the parameters in use. The index binations of possible mismatches and gaps in the
of the entire human genome only takes 2–8 GB of query sequence such that the altered query can be
memory. aligned exactly.
A 105
gap-pe some algorithms. For example, on the simulated
gap-se
ungap-se data used in Figure 2B, when the aligner in use
104 bwasw-se
# wrongly mapped reads
novo-pe does ungapped alignment only, an indel polymorph-
ism is seen causing seven reads mapped to a wrong
103
position with high confidence. BreakDancer [55]
predicts a high scoring translocation based on the
102
wrong alignments. Effective gapped aligners such as
BWA and novoalign ([Link] do not
101
produce this false translocation. Therefore, gapped
100
alignment is essential to the variant discovery, but
1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 how ChIP- and RNA-seq [2] may be affected is
# mapped reads (x106)
an open question.
B 104
gap-pe
gap-se
ungap-se
ungap-se-GATK
Role of paired-end and mate-pair
103 bwasw-se
mapping
novo-pe
Some sequencing technologies produce read pairs
# false SNPs
103 A C
novo-noQual
novo-qual
maq-noQual
maq-qual
# wrongly mapped reads
102
B
1
10
D E
100
350 360 370 380 390 400 410
# mapped reads (x103)
Figure 3: Alignment accuracy of simulated reads Figure 4: Color-space encoding. (A) Color space
with and without base quality. Paired-end reads (51bp) encoding matrix. (B) Conversion between base and
are simulated by MAQ from the human genome, assum- color sequence. (C) The color encoding of the reverse
ing 0.085% substitution and 0.015% indel mutation rate. complement of the base sequence is the reverse of the
Base quality model is trained from run ERR000589 color sequence. (D) A sequencing error leads to con-
alignment gaps and to allow partially aligned read one base mutation leads to two contiguous color
sequences in alignment. At present, all programs cap- changes with some restrictions (Figure 4E). Two ad-
able of genome-wide long-read alignment follow
jacent consistent color changes are preferred over
the seed-and-extend paradigm, seeding the align-
two discontinuous changes. A better solution is to
ment using hash table index [8, 9] or more recently
perform a color-aware Smith–Waterman alignment
FM-index [50, 51], and extending seed matches with
found in BFAST and SHRiMP [29, 56]. This exten-
the banded Smith–Waterman algorithm. This allows
sion to the standard Smith–Waterman algorithm
for sensitive detection of indels as well as allowing for
allows the detection of indels without the aid of
partial hits.
post-alignment analysis at the cost of increased com-
putational complexity. Most alignment algorithms
Aligning SOLiD reads described in the previous sections can be applied to
The SOLiD sequencing technology observes two SOLiD sequencing reads with few modifications
adjacent bases simultaneously. Each dinucleotide making them color space aware.
(16 possibilities) is encoded as one of four possible
colors, with the encoding referred to as color space
(Figure 4A). Although the known primer base allows Aligning bisulfite-treated reads
for the decoding of a color read to bases (Figure 4B), Bisulfite sequencing is a technology to identify
contiguous errors will arise from a single color methylation patterns [3]. From the alignment point
sequencing error in this conversion (Figure 4D). of view, unmethylated ‘C’ bases, or cytosines, are
Thus algorithms that naively decode a color read converted to ‘T’ (sequences 1 and 4 in Figure 5)
will fail. Given the fact that reverse complementing and ‘G’ bases complement those cytosines converted
a base sequence is equivalent to reversing the color to ‘A’ (sequences 2 and 3). Directly aligning con-
sequence (Figure 4C), the proper solution is to verted sequences against the standard reference se-
encode the reference as a color sequence and align quence would be difficult due to the excessive
color reads directly to the color reference as if they mismatches. Most aligners capable of bisulfite align-
are base sequences with the exception of the com- ments [24, 57] do the following. They create two
plementing rule. After alignment, color sequence reference sequences, one with all ‘C’ bases converted
can be converted to base sequences with dynamic to ‘T’ bases (the C-to-T reference) and the other
programming [48]. with all ‘G’ bases converted to ‘A’ bases (the
Performing alignment entirely in the color space G-to-A reference). In alignment, ‘C’ bases are con-
may not be optimal, though. With color encoding, verted to ‘T’ base for reads and are mapped to the
480 Li and Homer
Realignment
Reads mapped to the same locus are highly corre-
lated, but all read aligners map a read independent of
Figure 5: Bisulfite sequencing. Cytosines with under- others and thus cannot make use of the correlation
lines are not methylated. Denaturation and bisulfite
between reads or the expected coverage at the same
treatment will convert these cytosines to uracils.
After amplification, four different sequences from the
position. Especially in the presence of indels, not
original double-strand DNA result. using this correlation may lead to wrong an align-
ment around the tail of a read. For indel calling, it
is necessary to perform multi-alignment for reads
C-to-T reference (then a C–T mismatch is effective- mapped to the same locus. Realigner [61] is such a
ly regarded as a match); a similar procedure is per- tool, but originally designed for capillary read align-
formed for the G-to-A conversion in the next round ment. GATK implements a different algorithm for
of alignment. The results from two rounds of align- new sequencing data. Sophisticated indel callers such
as SAMtools [62] also implicitly realign reads around
speed is of the order of 0.5 Gbp per CPU day, much Key Points
slower than short-read aligners. Recently developed The advent of new sequencing technologies paves the way
algorithms such as Mosaik and BWA-SW are faster for various biological studies, most of which involves sequence
alignment in an unparalleled scale.
and may alleviate this computational bottleneck. The development of alignment algorithms has been successful
and short-read alignment against a single reference is not the
bottleneck in data analyses any more.
With the increasing read lengths produced by the new sequen-
CONCLUSIONS AND FUTURE cing technologies, we expect further development in multi-
DEVELOPMENT reference alignment, long-read alignment and de novo assembly.
Short-read alignment is thought to be the computing
bottleneck of the analysis of new sequencing data.
Fortunately, the active development of alignment Acknowledgements
algorithms opens this bottleneck even with the rap- We thank the three anonymous reviewers whose comments
helped us to improve the manuscript.
idly increasing throughput of sequencing machines.
In a couple of years, however, long reads will dom-
inate again and programs developed for short reads
will not be applicable; long-read alignment and FUNDING
H.L. is supported by the NIH grant 1U01HG005208-
15. Gotoh O. An improved algorithm for matching biological 35. Eppstein D, Galil Z, Giancarlo R, Italiano GF. Sparse
sequences. J Mol Biol 1982;162:705–8. dynamic programming. In: SODA. Philadelphia: Society
16. Ma B, Tromp J, Li M. PatternHunter: faster and more for Industrial and Applied Mathematics, 1990;513–22.
sensitive homology search. Bioinformatics 2002;18:440–5. 36. Slater GSC, Birney E. Automated generation of heuristics
17. Li R, Li Y, Kristiansen K, etal. SOAP: short oligonucleotide for biological sequence comparison. BMC Bioinformatics
alignment program. Bioinformatics 2008;24:713–4. 2005;6:31.
18. Jiang H, Wong WH. SeqMap: mapping massive amount 37. Myers EW. An O(ND) Difference algorithm and its
of oligonucleotides to the genome. Bioinformatics 2008;24: variations. Algorithmica 1986;1(2):251–66.
2395–6. 38. Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix
19. Li H, Ruan J, Durbin R. Mapping short DNA sequencing trees with enhanced suffix arrays. J DiscreteAlgorithms 2004;2:
reads and calling variants using mapping quality scores. 53–86.
Genome Res 2008;18:1851–8. 39. Ferragina P, Manzini G. Opportunistic data structures with
20. Smith AD, Xuan Z, Zhang MQ. Using quality scores and applications. In: Proceedings of the 41st Symposium on
longer reads improves accuracy of Solexa read mapping. Foundations of Computer Science (FOCS 2000), Redondo Beach,
BMC Bioinformatics 2008;9:128. CA, USA. 2000;390–8.
21. Smith AD, Chung WY, Hodges E, et al. Updates to the 40. Burrows M, Wheeler DJ. A block-sorting lossless data com-
RMAP short-read mapping software. Bioinformatics 2009;25: pression algorithm. Technical Report 124, Digital
2841–2. Equipment Corporation. CA: Palo Alto, 1994.
22. Baeza-Yates RA, Perleberg CH. Fast and practical approx- 41. Munro JI, Raman V, Rao SS. Space efficient suffix trees.
imate string matching. In: Apostolico A, Crochemore M, J Algorithms 2001;39(2):205–22.
55. Chen K, Wallis JW, McLellan MD, et al. BreakDancer: an 64. Manske HM, Kwiatkowski DP. LookSeq: a browser-based
algorithm for high-resolution mapping of genomic struc- viewer for deep sequencing data. Genome Res 2009;19(11):
tural variation. Nat Methods 2009;6:677–81. 2125–32.
56. Homer N, Merriman B, Nelson SF. Local alignment of 65. Milne I, Bayer M, Cardle L, et al. Tablet–next generation
two-base encoded DNA sequence. BMC Bioinformatics sequence assembly visualization. Bioinformatics 2010;26(3):
2009;10:175. 401–2.
57. Xi Y, Li W. BSMAP: whole genome bisulfite sequence 66. Carver T, Bohme U, Otto T, et al. BamView: viewing
MAPping program. BMC Bioinformatics 2009;10:232. mapped read alignment data in the context of the reference
58. Mortazavi A, Williams BA, McCue K, et al. Mapping sequence. Bioinformatics 2010;26(5):676–7.
and quantifying mammalian transcriptomes by RNA-Seq. 67. Koboldt DC, Chen K, Wylie T, et al. VarScan: variant
Nat Methods 2008;5:621–8. detection in massively parallel sequencing of individual
59. De Bona F, Ossowski S, Schneeberger K, et al. Optimal and pooled samples. Bioinformatics 2009;25:2283–5.
spliced alignments of short sequence reads. Bioinformatics 68. Langmead B, Schatz MC, Lin J, et al. Searching for SNPs
2008;24:i174–80. with cloud computing. Genome Biol 2009;10(11):R134.
60. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering 69. Li R, Li Y, Zheng H, et al. Building the sequence map of
splice junctions with RNA-Seq. Bioinformatics 2009;25: the human pan-genome. Nat Biotechnol 2010;28:57–63.
1105–11. 70. Schneeberger K, Hagmann J, Ossowski S, et al.
61. Anson EL, Myers EW. ReAligner: a program for refining Simultaneous alignment of short reads against multiple
DNA sequence multi-alignments. J Comput Biol 1997;4(3): genomes. Genome Biol 2009;10:R98.
369–83. 71. Mäkinen V, Navarro G, Sirén J, et al. Storage and retrieval