0% found this document useful (0 votes)
32 views11 pages

Sequence Alignment Algorithms Review

This document reviews the advancements in sequence alignment algorithms developed for next-generation sequencing technologies, highlighting their importance in processing large-scale genomic data. It categorizes alignment algorithms into those based on hash tables and suffix trees, discussing their efficiency and practical applications. The authors conclude that short-read alignment is no longer a major bottleneck, and they explore future developments in alignment algorithms, particularly in relation to long sequence reads and cloud computing.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views11 pages

Sequence Alignment Algorithms Review

This document reviews the advancements in sequence alignment algorithms developed for next-generation sequencing technologies, highlighting their importance in processing large-scale genomic data. It categorizes alignment algorithms into those based on hash tables and suffix trees, discussing their efficiency and practical applications. The authors conclude that short-read alignment is no longer a major bottleneck, and they explore future developments in alignment algorithms, particularly in relation to long sequence reads and cloud computing.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

B RIEFINGS IN BIOINF ORMATICS . VOL 11. NO 5. 473^ 483 doi:10.

1093/bib/bbq015
Advance Access published on 11 May 2010

A survey of sequence alignment


algorithms for next-generation
sequencing
Heng Li and Nils Homer
Submitted: 3rd March 2010; Received (in revised form): 14th April 2010

Abstract
Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis
of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety
of alignment algorithms and software have been subsequently developed over the past two years. In this article,
we will systematically review the current development of these algorithms and introduce their practical applications

Downloaded from [Link] by guest on October 19, 2015


on different types of experimental data. We come to the conclusion that short-read alignment is no longer the
bottleneck of data analyses. We also consider future development of alignment algorithms with respect to emerging
long sequence reads and the prospect of cloud computing.
Keywords: new sequencing technologies; alignment algorithm; sequence analysis

INTRODUCTION emergence of such data, researchers have quickly


The rapid development of new sequencing technol- realized that even the best tools for aligning capillary
ogies substantially extends the scale and resolution reads [8, 9] are not efficient enough given the unpre-
of many biological applications, including the scan cedented amount of data. To keep pace with the
of genome-wide variation [1], identification of pro- throughput of sequencing technologies, many new
tein binding sites (ChIP-seq), quantitative analysis alignment tools have been developed in the last two
of transcriptome (RNA-seq) [2], the study of the years. These tools exploit the many advantages spe-
genome-wide methylation pattern [3] and the as- cific to each new sequencing technology, such as the
sembly of new genomes or transcriptomes [4]. short sequence length of Illumina, SOLiD and
Most of these applications take alignment or de Helicos reads, the di-base encoding of SOLiD
novo assembly as the first step; even in de novo as- reads, the high base quality towards the 50 -end of
sembly, sequence reads may still need to be aligned Illumina and 454 reads, the low indel error rate
back to the assembly as most large-scale short-read of Illumina reads and the low substitution error
assemblers [5, 6] do not track the location of each rate of Helicos reads. Short read aligners outperform
individual read. Sequence alignment is therefore es- traditional aligners in terms of both speed and accur-
sential to nearly all the applications of new sequen- acy. They greatly boost the applications of new
cing technologies. sequencing technologies as well as the theoretical
All new sequencing technologies in production, studies of alignment algorithms.
including Roche/454, Illumina, SOLiD and Helicos, This article aims to systematically review the
are able to produce data of the order of giga recent advance with respect to alignment algorithms.
base-pairs (Gbp) per machine day [7]. With the It is organized as follows. We first review the

Corresponding author. Heng Li, Broad Institute, 5 Cambridge Center, Cambridge, MA 02142, USA. Tel: 617 714 7000;
Fax: 617 714 8102; E-mail: hengli@[Link]
Heng Li is a postdoctoral researcher at the Broad Institute and used to work at the Sanger Institute where he developed several popular
alignment algorithms.
Nils Homer is a PhD student in Computer Science and Human Genetics departments of UCLA. He developed the BFAST alignment
algorithm.

ß The Author 2010. Published by Oxford University Press. For Permissions, please email: [Link]@[Link]
474 Li and Homer

progress on general alignment techniques and then A seed allowing internal mismatches is called spaced
examine their applications in the context of specific seed; the number of matches in the seed is its weight.
sequencing platforms and experimental designs. Eland (A.J. Cox, unpublished results) was the first
We will use simulated data to evaluate the necessity program that utilized spaced seed in short-read align-
of gapped alignment and paired-end mapping, and ment. It uses six seed templates spanning the entire
present a list of alignment software that are actively short read such that a two-mismatch hit is guaranteed
maintained and widely used. Finally, we will discuss to be identified by at least one of the templates, no
the future development of alignment algorithms. matter where the two mismatches occur. SOAP [17]
adopts almost the same strategy except that it indexes
the genome rather than reads. SeqMap [18] and
OVERVIEW OF ALIGNMENT MAQ [19] extends the method to allow k-mis-
ALGORITHMS matches, but to be fully sensitive to k-mismatch
Most of fast alignment algorithms construct auxiliary hits, they require ð2kkÞ templates, which is exponential
data structures, called indices, for the read sequences in k and thus inefficient given large k. To improve
or the reference sequence, or sometimes both. the speed, MAQ only guarantees to find
Depending on the property of the index, alignment two-mismatch hit in the first 28 bp of each read,
algorithms can be largely grouped into three cate- the most reliable part of an Illumina read. It extends

Downloaded from [Link] by guest on October 19, 2015


gories: algorithms based on hash tables, algorithms the partial match when a seed match is found.
based on suffix trees and algorithms based on RMAP [20, 21], which is based on the Baeza–
merge sorting. The third category only consists Yates–Perleberg algorithm [22], applies a different set
of Slider [10] and its descendant SliderII [11]. of seed templates. It effectively uses k þ 1 templates
This review will therefore focus on the first two to find k-mismatch hits. RMAP reduces the number
categories. of templates, but for large k, the weight of each
template is small. Such a strategy cannot fully take
Algorithms based on hash tables the advantage of hash table indexing in this case as
The idea of hash table indexing can be tracked back many candidates will be returned.
to BLAST [12, 13]. All hash table based algorithms An improvement is achieved by Lin et al. [23]
essentially follow the same seed-and-extend para- who present the optimal the way to design a min-
digm. BLAST keeps the position of each k-mer imum number of spaced seeds, given a specified read
(k ¼ 11 by default) subsequence of the query in a length, sensitivity requirement and memory usage.
hash table with the k-mer sequence being the key, For example, their program ZOOM is able to iden-
and scans the database sequences for k-mer exact tify all two-mismatch hits for 32 bp reads using five
matches, called seeds, by looking up the hash table. seed templates of weight 14. In comparison, RMAP
BLAST extends and joins the seeds first without gaps uses three templates of weight 10; Eland uses six
and then refines them by a Smith–Waterman align- templates of weight 16 but with only 12.5 bases
ment [14, 15]. It outputs statistically significant local indexed in the hash table to reduce memory require-
alignments as the final results. ment. As the time complexity of spaced seed algo-
The basic BLAST algorithm has been improved rithm is approximately proportional to where q is the
and adapted to alignments of different types. weight, m the number of templates, n the number of
Nevertheless, the techniques discussed below focus reads and L the genome size, ZOOM has better the-
on mapping a set of short query sequences against a oretical time complexity given limited memory.
long reference genome of the same species. The memory required by hashing genome is usu-
ally bytes where s is the sampling frequency [24]. It is
Improvement on seeding: spaced seed memory demanding to hold in RAM a hash table
BLAST seeds alignment with 11 consecutive with q larger than 15. Homer et al. [25] proposed a
matches by default. Ma et al. [16] discovered that two-level indexing scheme for any large q. They
seeding with non-consecutive matches improves build a hash table for j-long (j < q, typically 14)
sensitivity. For example, a template ‘1110100101 bases. To find a q-long key, they look up the hash
00110111’ requiring 11 matches at the ‘1’ positions table from the first j bases and then perform a binary
is 55% more sensitive than BLAST’s default template search among elements stored in the resulting
‘11111111111’ for two sequences of 70% similarity. bucket. Looking up a q-long key takes time, only
Sequence alignment algorithms for next-generation sequencing 475

Table 1: Popular short-read alignment software Improvements on seed extension


Due to the use of long spaced seeds, many aligners
Program Algorithm SOLiD Longa Gapped PEb Qc do not need to perform seed extension or only
Bfast hashing ref. Yes No Yes Yes No extend a seed match without gaps, which is much
Bowtie FM-index Yes No No Yes Yes faster than applying a full dynamic programming.
BWA FM-index Yesd Yese Yes Yes No Nonetheless, several improvements over BLAST
MAQ hashing reads Yes No Yesf Yes Yes
Mosaik hashing ref. Yes Yes Yes Yes No have been made regarding on seed extension. A
Novoaligng hashing ref. No No Yes Yes Yes major improvement comes from the recent advance
in accelerating the standard Smith–Waterman with
a
Work well for Sanger and 454 reads, allowing gaps and clipping. vectorization. The basic idea is to parallelize align-
b
Paired end mapping. cMake use of base quality in alignment. dBWA
trims the primer base and the first color for a color read. eLong-read
ment with the CPU SIMD instructions such that
alignment implemented in the BWA-SW module. fMAQ only does multiple parts of a query sequence can be processed
gapped alignment for Illumina paired-end reads. gFree executable for in one CPU cycle. Using the SSE2 CPU instructions
non-profit projects only. implemented in most latest x86 CPUs, [34] derived a
revised Smith–Waterman algorithm that is over
10 times faster than the standard algorithm.
slightly worse than the optimal speed O(1). The peak Novoalign ([Link] CLC Genomics

Downloaded from [Link] by guest on October 19, 2015


memory becomes independent of q. A similar idea is Workbench ([Link]
also used by Eland and MAQ, but they index reads id ¼ 1240) and SHRiMP are known to make use
instead of the genome. of vectorization.
Many other aligners [26–28] also use spaced seed Another improvement is achieved by constraining
with different templates designed specifically for the dynamic programming around seeds already found in
reference genome and sensitivity tolerances, making the seeding step [25, 35, 36]. Thus unnecessary visits
spaced seed the most popular approach for short-read to cells far away from seed hits in iteration are greatly
alignment. reduced. In addition, Myers [37] found that a query
can be aligned in full length to an L-long target
Improvement on seeding: q-gram filter and sequence with up to k mismatches and gaps in
multiple seed hits O(kL) time, independent of the length of the
A potential problem with consecutive seed and query. These techniques also help to accelerate
spaced seed is they disallow gaps within the seed. the alignment when dynamic programming is the
Gaps are usually found afterwards in the extension bottleneck.
step by dynamic programming, or by attempting
small gaps at each read positions [17, 18]. The Algorithms based on suffix/prefix
q-gram filter, as is implemented in SHRiMP [29] tries
and RazerS [30], provides a possible solution All algorithms in this category essentially reduce the
to building an index natively allowing gaps. The inexact matching problem to the exact matching
q-gram filter is based on the observation that at the problem and implicitly involve two steps: identifying
occurrence of a w-long query string with at most k exact matches and building inexact alignments sup-
differences (mismatches and gaps), the query and the ported by exact matches. To find exact matches,
w-long database substring share at least (w þ 1)  these algorithms rely on a certain representation of
(k þ 1)q common substrings of length q [31–33]. suffix/prefix trie, such as suffix tree, enhanced suffix
Methods based on spaced seeds and the q-gram array [38] and FM-index [39]. The advantage of
filter are similar in that they both rely on fast using a trie is that alignment to multiple identical
lookup in a hash table. They are mainly different copies of a substring in the reference is only
in that the former category initiates seed extension needed to be done once because these identical
from one long seed match, while the latter initiates copies collapse on a single path in the trie, whereas
extension usually with multiple relatively short seed with a typical hash table index, an alignment must be
matches. In fact, the idea of requiring multiple seed performed for each copy.
matches is more frequently seen in capillary read It should be noted that the choice of these data
aligners such as SSAHA2 and BLAT; it is a major structures is independent of methods for finding in-
technique to accelerate long-read alignment. exact matches. An algorithm built upon FM-index,
476 Li and Homer

A B C

D E

Figure 1: Data structures based on a prefix trie. (A) Prefix trie of string AGGAGC where symbol Œ marks the

Downloaded from [Link] by guest on October 19, 2015


start of the string. The two numbers in each node give the suffix array interval of the substring represented by
the node, which is the string concatenation of edge symbols from the node to the root. (B) Compressed prefix
trie by contracting nodes with in- and out-degree both being one. (C) Prefix tree by representing the substring on
each edge as the interval on the original string. (D) Prefix directed word graph (prefix DAWG) created by collapsing
nodes of the prefix trie with identical suffix array interval. (E) Constructing the suffix array and Burrows^
Wheeler transform of AGGAGC. The dollar symbol marks the end of the string and is lexicographically smaller
than all the other symbols. The suffix array interval of a substring W is the maximal interval in the suffix array
with all suffixes in the interval having W as prefix. For example, the suffix array interval of AG is [1, 2]. The two suf-
fixes in the interval are AGC$ and AGGAGC$, starting at position 3 and 0, respectively. They are the only suffixes
that have AG as prefix.

for example, would also work with suffix tree index reference sequence. However, a trie takes OðL2 Þ
in principle. space where L is the length of the reference. It is
impractical to build a trie even for a bacterial
genome. Several data structures are proposed to
Trie, prefix/suffix tree and FM-index reduce the space. Among these data structures, a
A suffix trie, or simply a trie, is a data structure that suffix tree (Figure 1C) is most widely used. It
stores all the suffixes of a string, enabling fast string achieves linear space while allowing linear-time
matching. To establish the link between a trie and an searching. Although it is possible in theory to repre-
FM-index, a data structure based on Burrows- sent a suffix tree in Llog2 L þ OðLÞ bits using
Wheeler Transform (BWT) [40], we focus on rank-selection operations [41], even the most space
prefix trie which is the trie of the reverse string. efficient implementation of bioinformatics tools re-
All algorithms on a trie can be seamlessly applied quires 12–17 bytes per nucleotide [42], making it
to the corresponding prefix trie. impractical to hold the suffix tree of the human
Figure 1A gives the prefix trie of AGGAGC. genome in memory.
Finding all exact matches of a query sequence is To solve this problem, Abouelhoda et al. [38]
equivalent to searching for a path descending from derived an enhanced suffix array that consists of a
the root where each edge label on the path matches a suffix array and several auxiliary arrays, taking
query letter in the reverse order. If such a path exists, 6.25 bytes per nucleotide. It can be regarded as an
the query is a substring. Given a query AGC, for implicit representation of suffix tree, and has an iden-
example, the path matching the query is [0, 6]! tical time complexity to suffix tree in finding exact
[3, 3]![5, 5]![1, 1]. matches, better than the suffix array originally in-
The time complexity of determining if a query vented by Manber and Myers [43].
has an exact match against a trie is linear in the length A further improvement on memory is achieved
of the query, independent of the length of the by Ferragina and Manzini [39] who proposed the
Sequence alignment algorithms for next-generation sequencing 477

FM-index and found that locating a child of a parent by dynamic programming. BWA-SW furthers
node in the prefix trie can be done in constant time BWT-SW by representing the query as a directed
using a backwards search on this data structure. Thus word graph (DAWG) [52], which also enables it to
the time complexity of finding exact matches with deploy heuristics to accelerate alignment.
an FM-index is identical to that with a trie. With Bowtie and BWA also sample short substrings of
respect to memory, the FM-index was originally de- the reference, but instead of performing dynamic
signed as a compressed data structure such that the programming, they compare the query and sampled
theoretical index size can be smaller than the original substrings only allowing a few differences. In add-
string if the string contains repeats (equivalently, has ition, as they require the entire read to be aligned,
small entropy). The FM-index is usually not com- the traversal of the trie can be bounded since it is
pressed for better performance during alignment unnecessary to descend deeper in the trie if it can be
since DNA sequences have a small alphabet. The predicted that doing so leads to an alignment with
practical memory footprint of an FM-index is typic- excessive mismatches and gaps. Alternatively, Bowtie
ally 0.5–2 bytes per nucleotide, depending on im- and BWA can be considered to enumerate all com-
plementations and the parameters in use. The index binations of possible mismatches and gaps in the
of the entire human genome only takes 2–8 GB of query sequence such that the altered query can be
memory. aligned exactly.

Downloaded from [Link] by guest on October 19, 2015


It is worth noting that we only focus on the data
structures having been used for DNA sequence
alignment. There is a large volume of literature in ALIGNING NEW SEQUENCING
computer science on general theory of string match- READS
ing, especially on short string matching. Readers are The algorithms reviewed above are general tech-
referred to ref. [44] for a more comprehensive niques. Depending on the characteristics of the
review in a wider scope. Nevertheless, traditional sequencing technologies and their applications,
string matching algorithms strive for completeness, aligners for new sequence reads also implement
while many current aligners sacrifice absolute com- extra features.
pleteness for speed.
Effect of gapped alignment
Finding inexact matches with a suffix/prefix trie Sequence reads from Illumina and SOLiD technol-
Of published aligners that can be used for query- ogies were initially 25 bp in length. Performing
reference alignment, MUMmer [42] and OASIS gapped alignment for such short reads is computa-
[45] are based on suffix tree, Vmatch [38] and tionally challenging because allowing a gap in this
Segemehl [46] on enhanced suffix array, and case will slow down most seeding algorithms.
Bowtie [47], BWA [48], SOAP2 [49], BWT-SW Fortunately, the growing read length makes gapped
[50] and BWA-SW [51] on FM-index. As explained alignment tractable, although this feature still comes
above, a program built upon one representation of at the cost of efficiency. This raises the question
suffix/prefix trie can be easily migrated to another. about whether gapped alignment is worth doing.
The FM-index is most widely used mainly due to its From Figure 2A, it is clear that gapped alignment
small memory footprint. (curve ‘gap-se’) increases sensitivity by a few percent
As to the algorithms for inexact matching, in comparison to ungapped alignment (curve
MUMmer and Vmatch anchor the alignment with ‘ungap-se’), but does not reduce alignment errors.
maximal unique matches (MUMs), maximal To this end, gapped alignment does not seem to
matches, maximal repeats or exact matches, and be an essential feature. However, gapped alignment
then join these exact matches with gapped align- plays a far more important role in variant discovery
ment. Similarly, Segemehl initiates the alignment [53, 54]. When gapped alignment is not imple-
with the longest prefix match of each suffix, but it mented, a read containing an indel polymorphism
may also enumerate mismatches and gaps at certain may still be mapped to the correct position but
positions of the query to reduce false alignments. with consecutive mismatches towards the underlying
OASIS and BWT-SW essentially sample sub- location of the indel. These mismatches can be seen
strings of the reference by a top–down traversal on on multiple reads mapped to the same locus, which
the trie and align these substrings against the query cause most variant callers to call false SNPs. As a
478 Li and Homer

A 105
gap-pe some algorithms. For example, on the simulated
gap-se
ungap-se data used in Figure 2B, when the aligner in use
104 bwasw-se
# wrongly mapped reads
novo-pe does ungapped alignment only, an indel polymorph-
ism is seen causing seven reads mapped to a wrong
103
position with high confidence. BreakDancer [55]
predicts a high scoring translocation based on the
102
wrong alignments. Effective gapped aligners such as
BWA and novoalign ([Link] do not
101
produce this false translocation. Therefore, gapped
100
alignment is essential to the variant discovery, but
1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 how ChIP- and RNA-seq [2] may be affected is
# mapped reads (x106)
an open question.
B 104
gap-pe
gap-se
ungap-se
ungap-se-GATK
Role of paired-end and mate-pair
103 bwasw-se
mapping
novo-pe
Some sequencing technologies produce read pairs
# false SNPs

such that the two reads are known to be close to

Downloaded from [Link] by guest on October 19, 2015


2
10
each other in physical chromosomal distance.
These reads are called paired-end or mate-pair
101
reads. With this mate-pair information, a repetitive
read will be reliably placed if its mate can be placed
100 unambiguously. Alignment errors may be detected
150 160 170 180 190 200 210
and fixed when wrong alignments break the mate-
# called SNPs (x103)
pair requirement. Figure 2A shows that paired-end
Figure 2: Alignment and SNP call accuracy under alignment outperforms single-end alignment in terms
different configurations of BWA and Novoalign. (A) of both sensitivity and specificity. The gain of sensi-
Number of misplaced reads as a function of the number tivity is also obvious from SNP discovery (Figure 2B).
of mapped reads under different mapping quality cut-off.
In addition, it is worth noting here that although
Reads (108 bp) were simulated from human genome
curve ‘novo-pe’ outperforms ‘gap-pe’ in Figure 2A,
build36 assuming 0.085% substitution and 0.015% indel
mutation rate, and 2% uniform sequencing error rate. the accuracy of SNP calls are similar in Figure 2B.
(B) Number of wrong SNP calls as a function of the This is possibly because the extra alignment errors
number of called SNP underdifferent SNP qualitycut-offs. from ‘gap-pe’ are random and thus contribute little
Reads (108 bp) were simulated from chr6 of the human to variant discovery.
genome and mapped back to the whole genome. SNPs
are called and filtered by SAMtools. In both figures, Using base quality in alignment
‘novo-pe’ denotes novoalign alignment; the rest corres- Smith et al. [20] discovered that using base quality
pond to alignments under different configurations of scores improves alignment accuracy because know-
BWA, where ‘gap-pe’ stands for the gapped paired-end
ing the error probability of each base, the aligner
(PE) alignment, ‘gap-se’ for gapped single-end (SE) align-
ment, ‘ungap-se’ for ungapped SE alignment, ‘bwasw-se’
may pay lower penalty for an error-prone mismatch.
for BWA-SW SE alignment, and ‘ungap-se-GATK’ for Figure 3 shows that using base quality score halves
alignment cleaned by the GATK realigner. alignment errors when the quality score is accurate.
In practice, however, accurate quality score is not
always available from the base calling pipeline.
result, more false SNPs are predicted from ungapped Recalibration of quality score is recommended to
alignment (Figure 2B) and these SNPs cannot be make this strategy more effective.
easily filtered out even with the help of sophisticated
tools such as the GATK realigner ([Link] Aligning long sequence reads
.com/broad-gatk); all high-quality false SNPs by Long reads have greater potential than short reads
‘gap-se’ are also around undetected long indels. to contain long indels, structural variations and mis-
Furthermore, lacking gapped alignment may also assemblies in the reference genome. It is essential
lead to false structural variation calls at least for for a long-read aligner to be permissive about
Sequence alignment algorithms for next-generation sequencing 479

103 A C
novo-noQual
novo-qual
maq-noQual
maq-qual
# wrongly mapped reads

102

B
1
10

D E
100
350 360 370 380 390 400 410
# mapped reads (x103)

Figure 3: Alignment accuracy of simulated reads Figure 4: Color-space encoding. (A) Color space
with and without base quality. Paired-end reads (51bp) encoding matrix. (B) Conversion between base and
are simulated by MAQ from the human genome, assum- color sequence. (C) The color encoding of the reverse
ing 0.085% substitution and 0.015% indel mutation rate. complement of the base sequence is the reverse of the
Base quality model is trained from run ERR000589 color sequence. (D) A sequencing error leads to con-

Downloaded from [Link] by guest on October 19, 2015


from the European short read archive. Base quality is tiguous errors when the color sequence is converted
not used in alignment for curves with labels ended to base sequence. (E) A mutation causes two contigu-
with ‘-noQual’. ous color changes.

alignment gaps and to allow partially aligned read one base mutation leads to two contiguous color
sequences in alignment. At present, all programs cap- changes with some restrictions (Figure 4E). Two ad-
able of genome-wide long-read alignment follow
jacent consistent color changes are preferred over
the seed-and-extend paradigm, seeding the align-
two discontinuous changes. A better solution is to
ment using hash table index [8, 9] or more recently
perform a color-aware Smith–Waterman alignment
FM-index [50, 51], and extending seed matches with
found in BFAST and SHRiMP [29, 56]. This exten-
the banded Smith–Waterman algorithm. This allows
sion to the standard Smith–Waterman algorithm
for sensitive detection of indels as well as allowing for
allows the detection of indels without the aid of
partial hits.
post-alignment analysis at the cost of increased com-
putational complexity. Most alignment algorithms
Aligning SOLiD reads described in the previous sections can be applied to
The SOLiD sequencing technology observes two SOLiD sequencing reads with few modifications
adjacent bases simultaneously. Each dinucleotide making them color space aware.
(16 possibilities) is encoded as one of four possible
colors, with the encoding referred to as color space
(Figure 4A). Although the known primer base allows Aligning bisulfite-treated reads
for the decoding of a color read to bases (Figure 4B), Bisulfite sequencing is a technology to identify
contiguous errors will arise from a single color methylation patterns [3]. From the alignment point
sequencing error in this conversion (Figure 4D). of view, unmethylated ‘C’ bases, or cytosines, are
Thus algorithms that naively decode a color read converted to ‘T’ (sequences 1 and 4 in Figure 5)
will fail. Given the fact that reverse complementing and ‘G’ bases complement those cytosines converted
a base sequence is equivalent to reversing the color to ‘A’ (sequences 2 and 3). Directly aligning con-
sequence (Figure 4C), the proper solution is to verted sequences against the standard reference se-
encode the reference as a color sequence and align quence would be difficult due to the excessive
color reads directly to the color reference as if they mismatches. Most aligners capable of bisulfite align-
are base sequences with the exception of the com- ments [24, 57] do the following. They create two
plementing rule. After alignment, color sequence reference sequences, one with all ‘C’ bases converted
can be converted to base sequences with dynamic to ‘T’ bases (the C-to-T reference) and the other
programming [48]. with all ‘G’ bases converted to ‘A’ bases (the
Performing alignment entirely in the color space G-to-A reference). In alignment, ‘C’ bases are con-
may not be optimal, though. With color encoding, verted to ‘T’ base for reads and are mapped to the
480 Li and Homer

a more comprehensive review on practical issues on


processing RNA-seq data.

Realignment
Reads mapped to the same locus are highly corre-
lated, but all read aligners map a read independent of
Figure 5: Bisulfite sequencing. Cytosines with under- others and thus cannot make use of the correlation
lines are not methylated. Denaturation and bisulfite
between reads or the expected coverage at the same
treatment will convert these cytosines to uracils.
After amplification, four different sequences from the
position. Especially in the presence of indels, not
original double-strand DNA result. using this correlation may lead to wrong an align-
ment around the tail of a read. For indel calling, it
is necessary to perform multi-alignment for reads
C-to-T reference (then a C–T mismatch is effective- mapped to the same locus. Realigner [61] is such a
ly regarded as a match); a similar procedure is per- tool, but originally designed for capillary read align-
formed for the G-to-A conversion in the next round ment. GATK implements a different algorithm for
of alignment. The results from two rounds of align- new sequencing data. Sophisticated indel callers such
as SAMtools [62] also implicitly realign reads around

Downloaded from [Link] by guest on October 19, 2015


ment are combined to generate the final report.
If there are no mutations or sequencing errors, a potential indels.
bisulfite treated read can always be mapped exactly
in one of the two rounds. Alignment software
Over 20 short-read alignment software have been
Aligning spliced reads published in the past 2 years and dozens of more
Transcriptome sequencing, or RNA-seq [2], pro- are still unpublished. The availability of these tools
duces reads from transcribed sequences with introns greatly boosts the development of alignment algo-
and intergenetic regions excluded. When RNA-seq rithms, yet only a few of them are being heavily
reads are aligned against the genomic sequence, a used. Table 1 gives a list of free popular short-read
read may be mapped to a splicing junction, which alignment software packages, based on the ‘tag
will fail with a standard alignment algorithm. It is cloud’ of the SEQanswers forum ([Link]
possible to add sequences around known or pre- [Link]). They all output alignments in the SAM
dicted splicing junctions to the ref. [58] or more format [62], the emerging standard alignment format
cleverly to make the alignment algorithm aware of which is widely supported by alignment viewers
known splicing junctions [24]. Nevertheless, novel such as GBrowse [63], LookSeq [64], Tablet [65],
splicing will not be discovered this way. BamView [66], Gambit ([Link]
QPALMA [59] and TopHat [60] were developed gambit-viewer), IGV ([Link]
to solve this problem. They first align reads to the .org/igv/) and MagicViewer ([Link]
genome using a standard mapping program and .[Link]/magicviewer/), as well as generic variant
identify putative exons from clusters of mapped callers such as SAMtools, GATK ([Link]
reads or from reads mapped into introns at their .com/broad-gatk) VarScan [67] and BreakDancer
last few bases, possibly aided by splicing signals [55].
learnt from real data. In the next round, potential On speed, Bowtie and BWA typically align 7
junctions are enumerated within a certain distance Gbp against the human genome per CPU day.
around the putative exons. The unmapped reads In comparison, the standard Illumina base caller,
are then aligned against the sequences flanking the Bustard, processes 3 Gbp per CPU day and the
possible junctions. Novel junctions can thus be real-time image analysis requires similar amount of
found. However, Trapnell et al. [60] have reported CPU time to base calling (Skelly and Bonfields, per-
that only 72% of splicing sites found by ERANGE sonal communication). Therefore, alignment is no
[58] can be identified by TopHat without using longer the most time consuming step in the entire
known splicing sites (TopHat is able to consider data processing pipeline. Speed improvement will
known splicing), indicating that incorporating not greatly reduce the time spent on data analyses.
known splicing sites in alignment may be necessary For long reads, SSAHA2 and BLAT are still the
to RNA-seq. Readers are also referred to ref. [2] for most popular aligners. However, their alignment
Sequence alignment algorithms for next-generation sequencing 481

speed is of the order of 0.5 Gbp per CPU day, much Key Points
slower than short-read aligners. Recently developed  The advent of new sequencing technologies paves the way
algorithms such as Mosaik and BWA-SW are faster for various biological studies, most of which involves sequence
alignment in an unparalleled scale.
and may alleviate this computational bottleneck.  The development of alignment algorithms has been successful
and short-read alignment against a single reference is not the
bottleneck in data analyses any more.
 With the increasing read lengths produced by the new sequen-
CONCLUSIONS AND FUTURE cing technologies, we expect further development in multi-
DEVELOPMENT reference alignment, long-read alignment and de novo assembly.
Short-read alignment is thought to be the computing
bottleneck of the analysis of new sequencing data.
Fortunately, the active development of alignment Acknowledgements
algorithms opens this bottleneck even with the rap- We thank the three anonymous reviewers whose comments
helped us to improve the manuscript.
idly increasing throughput of sequencing machines.
In a couple of years, however, long reads will dom-
inate again and programs developed for short reads
will not be applicable; long-read alignment and FUNDING
H.L. is supported by the NIH grant 1U01HG005208-

Downloaded from [Link] by guest on October 19, 2015


de novo assembly will become crucial.
In addition, while major sequencing centers have 01 and N.H. by 1U01HG005210-01.
sufficient localized computing resources to analyze
data at present, such resources are not available to
small research groups, which hamper the application References
of new sequencing technologies. Even between 1. Dalca AV, Brudno M. Genome variation discovery with
major centers, data sharing in a large collaborative high-throughput sequencing data. Brief Bioinform 2010;11:
3–14.
project such as the 1000 genomes project (http://
2. Pepke S, Wold B, Mortazavi A. Computation for ChIP-seq
[Link]) poses challenges. One possible and RNA-seq studies. Nat Methods 2009;6:S22–32.
solution to these problems might be cloud comput- 3. Cokus SJ, Feng S, Zhang X, et al. Shotgun bisulphite
ing, with data uploaded and analyzed in a shared sequencing of the Arabidopsis genome reveals DNA
cloud. Several researchers [26, 68] have explored methylation patterning. Nature 2008;452:215–9.
this approach, but establishing a cloud computing 4. Flicek P, Birney E. Sense from sequence reads: methods for
alignment and assembly. Nat Methods 2009;6:S6–12.
framework requires the efforts of the entire commu-
5. Simpson JT, Wong K, Jackman SD, et al. ABySS: a parallel
nity. Furthermore, data transfer bottlenecks and assembler for short read sequence data. Genome Res 2009;19:
leased storage have yet to be proved cost-effective 1117–23.
for cloud computing. 6. Li R, Zhu H, Ruan J, et al. De novo assembly of human
Another trend of development is the simultan- genomes with massively parallel short read sequencing.
Genome Res 2010;20:265–72.
eous alignment against multiple genomes. Li et al.
7. Metzker ML. Sequencing technologies – the next genera-
[69] have found the presence of extensive novel se- tion. Nat Rev Genet 2010;11:31–46.
quences absent from the human reference genome, 8. Ning Z, Cox AJ, Mullikin JC. SSAHA: a fast search method
which may lead to the loss of information when for large DNA databases. Genome Res 2001;11:1725–9.
reads are aligned to a single genome. In the light 9. Kent WJ. BLAT–the BLAST-like alignment tool. Genome
of large-scale resequence projects such as the Res 2002;12:656–64.
1000 genomes project, the Drosophila population 10. Malhis N, Butterfield YSN, Ester M, et al. Slider–maximum
genomics project ([Link] and the 1001 use of probability information for alignment of short
sequence reads and SNP detection. Bioinformatics 2009;25:
genomes project ([Link] align- 6–13.
ment against multiple genomes will become increas- 11. Malhis N, Jones SJ. High quality SNP calling using Illumina
ingly important. Several groups [70, 71] have data at shallow coverage. Bioinformatics 2010;26:1029–35.
pioneered in this direction; the proposal of unifying 12. Altschul SF, Gish W, Miller W, et al. Basic local alignment
multi-genome alignment and de novo assembly with search tool. J Mol Biol 1990;215:403–10.
an assembly graph (Birney and Durbin, personal 13. Altschul SF, Madden TL, Schaffer AA, etal. Gapped BLAST
and PSI-BLAST: a new generation of protein database
communication) is attractive, but how to apply the search programs. Nucleic Acids Res 1997;25:3389–402.
methods given genome-wide human data is yet to be 14. Smith TF, Waterman MS. Identification of common
solved in practice. molecular subsequences. J Mol Biol 1981;147:195–7.
482 Li and Homer

15. Gotoh O. An improved algorithm for matching biological 35. Eppstein D, Galil Z, Giancarlo R, Italiano GF. Sparse
sequences. J Mol Biol 1982;162:705–8. dynamic programming. In: SODA. Philadelphia: Society
16. Ma B, Tromp J, Li M. PatternHunter: faster and more for Industrial and Applied Mathematics, 1990;513–22.
sensitive homology search. Bioinformatics 2002;18:440–5. 36. Slater GSC, Birney E. Automated generation of heuristics
17. Li R, Li Y, Kristiansen K, etal. SOAP: short oligonucleotide for biological sequence comparison. BMC Bioinformatics
alignment program. Bioinformatics 2008;24:713–4. 2005;6:31.
18. Jiang H, Wong WH. SeqMap: mapping massive amount 37. Myers EW. An O(ND) Difference algorithm and its
of oligonucleotides to the genome. Bioinformatics 2008;24: variations. Algorithmica 1986;1(2):251–66.
2395–6. 38. Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix
19. Li H, Ruan J, Durbin R. Mapping short DNA sequencing trees with enhanced suffix arrays. J DiscreteAlgorithms 2004;2:
reads and calling variants using mapping quality scores. 53–86.
Genome Res 2008;18:1851–8. 39. Ferragina P, Manzini G. Opportunistic data structures with
20. Smith AD, Xuan Z, Zhang MQ. Using quality scores and applications. In: Proceedings of the 41st Symposium on
longer reads improves accuracy of Solexa read mapping. Foundations of Computer Science (FOCS 2000), Redondo Beach,
BMC Bioinformatics 2008;9:128. CA, USA. 2000;390–8.
21. Smith AD, Chung WY, Hodges E, et al. Updates to the 40. Burrows M, Wheeler DJ. A block-sorting lossless data com-
RMAP short-read mapping software. Bioinformatics 2009;25: pression algorithm. Technical Report 124, Digital
2841–2. Equipment Corporation. CA: Palo Alto, 1994.
22. Baeza-Yates RA, Perleberg CH. Fast and practical approx- 41. Munro JI, Raman V, Rao SS. Space efficient suffix trees.
imate string matching. In: Apostolico A, Crochemore M, J Algorithms 2001;39(2):205–22.

Downloaded from [Link] by guest on October 19, 2015


Galil Z, Manber U, (eds). CPM, Lecture Notes in Computer 42. Kurtz S, Phillippy A, Delcher AL, et al. Versatile and open
Science, Vol. 644. Berlin: Springer, 1992:185–92. software for comparing large genomes. Genome Biol 2004;5:
23. Lin H, Zhang Z, Zhang MQ, et al. ZOOM! Zillions of R12.
oligos mapped. Bioinformatics 2008;24:2431–7. 43. Manber U, Myers EW. Suffix arrays: a new method
24. Wu TD, Nacu S. Fast and SNP-tolerant detection of com- for on-line string searches. SIAM J Comput 1993;22:
plex variants and splicing in short reads. Bioinformatics 2010; 935–48.
26:873–81. 44. Navarro G. A guided tour to approximate string matching.
25. Homer N, Merriman B, Nelson SF. BFAST: an alignment ACM Comput Surv 2001;33:31–88.
tool for large scale genome resequencing. PLoS One 2009;4: 45. Meek C, Patel JM, Kasetty S. OASIS: an online and accu-
e7767. rate technique for local-alignment searches on biological
26. Schatz M. CloudBurst: highly sensitive read mapping with sequences. In: Proceedings of 29th International Conference on
MapReduce. Bioinformatics 2009;25:1363–9. Very Large Data Bases (VLDB 2003), Berlin. 2003;910–21.
27. Chen Y, Souaiaia T, Chen T. PerM: efficient mapping of 46. Hoffmann S, Otto C, Kurtz S, et al. Fast mapping of short
short sequencing reads with periodic full sensitive spaced sequences with mismatches, insertions and deletions using
seeds. Bioinformatics 2009;25:2514–21. index structures. PLoS Comput Biol 2009;5:e1000502.
28. Clement NL, Snell Q, Clement MJ, et al. The GNUMAP 47. Langmead B, Trapnell C, Pop M, et al. Ultrafast and
algorithm: unbiased probabilistic mapping of oligonucleo- memory-efficient alignment of short DNA sequences to
tides from next-generation sequencing. Bioinformatics 2010; the human genome. Genome Biol 2009;10:R25.
26:38–45. 48. Li H, Durbin R. Fast and accurate short read alignment
29. Rumble SM, Lacroute P, Dalca AV, et al. SHRiMP: accu- with Burrows-Wheeler transform. Bioinformatics 2009;25:
rate mapping of short color-space reads. PLoS Comput Biol 1754–60.
2009;5:e1000386. 49. Li R, Yu C, Li Y, et al. SOAP2: an improved
30. Weese D, Emde AK, Rausch T, et al. RazerS–fast read ultrafast tool for short read alignment. Bioinformatics 2009;
mapping with sensitivity control. Genome Res 2009;19: 25:1966–7.
1646–54. 50. Lam TW, Sung WK, Tam SL, et al. Compressed index-
31. Jokinen P, Ukkonen E. Two algorithms for approximate ing and local alignment of DNA. Bioinformatics 2008;24:
string matching in static texts. In: MFCS, Lecture Notes in 791–7.
Computer Science, Vol. 520. Berlin: Springer, 1991:240–8. 51. Li H, Durbin R. Fast and accurate long-read alignment
32. Cao X, Li SC, Tung AKH. Indexing DNA sequences using with Burrows-Wheeler transform. Bioinformatics 2010;
q-Grams. In: Zhou L, Ooi BC, Meng X, (eds). DASFAA, 26(5):589–95.
Lecture Notes in Computer Science, Vol. 3453. Berlin: 52. Blumer A, Blumer J, Haussler D, et al. The smallest auto-
Springer, 2005:4–16. maton recognizing the subwords of a text. Theoretical
33. Burkhardt S, Kärkkäinen J. Better filtering with gapped Computer Science 1985;40:31–55.
q-grams. In: Apostolico A, Takeda M, (eds). CPM, Lecture 53. Ossowski S, Schneeberger K, Clark RM, et al. Sequencing
Notes in Computer Science, Vol. 2089. Berlin: Springer, of natural strains of Arabidopsis thaliana with short reads.
2001:73–85. Genome Res 2008;18:2024–2033.
34. Farrar M. Striped Smith-Waterman speeds database searches 54. Krawitz P, Rödelsperger C, Jäger M, et al. Microindel
six times over other SIMD implementations. Bioinformatics detection in short-read sequence data. Bioinformatics 2010;
2007;23:156–61. 26:722–9.
Sequence alignment algorithms for next-generation sequencing 483

55. Chen K, Wallis JW, McLellan MD, et al. BreakDancer: an 64. Manske HM, Kwiatkowski DP. LookSeq: a browser-based
algorithm for high-resolution mapping of genomic struc- viewer for deep sequencing data. Genome Res 2009;19(11):
tural variation. Nat Methods 2009;6:677–81. 2125–32.
56. Homer N, Merriman B, Nelson SF. Local alignment of 65. Milne I, Bayer M, Cardle L, et al. Tablet–next generation
two-base encoded DNA sequence. BMC Bioinformatics sequence assembly visualization. Bioinformatics 2010;26(3):
2009;10:175. 401–2.
57. Xi Y, Li W. BSMAP: whole genome bisulfite sequence 66. Carver T, Bohme U, Otto T, et al. BamView: viewing
MAPping program. BMC Bioinformatics 2009;10:232. mapped read alignment data in the context of the reference
58. Mortazavi A, Williams BA, McCue K, et al. Mapping sequence. Bioinformatics 2010;26(5):676–7.
and quantifying mammalian transcriptomes by RNA-Seq. 67. Koboldt DC, Chen K, Wylie T, et al. VarScan: variant
Nat Methods 2008;5:621–8. detection in massively parallel sequencing of individual
59. De Bona F, Ossowski S, Schneeberger K, et al. Optimal and pooled samples. Bioinformatics 2009;25:2283–5.
spliced alignments of short sequence reads. Bioinformatics 68. Langmead B, Schatz MC, Lin J, et al. Searching for SNPs
2008;24:i174–80. with cloud computing. Genome Biol 2009;10(11):R134.
60. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering 69. Li R, Li Y, Zheng H, et al. Building the sequence map of
splice junctions with RNA-Seq. Bioinformatics 2009;25: the human pan-genome. Nat Biotechnol 2010;28:57–63.
1105–11. 70. Schneeberger K, Hagmann J, Ossowski S, et al.
61. Anson EL, Myers EW. ReAligner: a program for refining Simultaneous alignment of short reads against multiple
DNA sequence multi-alignments. J Comput Biol 1997;4(3): genomes. Genome Biol 2009;10:R98.
369–83. 71. Mäkinen V, Navarro G, Sirén J, et al. Storage and retrieval

Downloaded from [Link] by guest on October 19, 2015


62. Li H, Handsaker B, Wysoker A, et al. The sequence align- of individual genomes. In: Batzoglou S (ed). RECOMB,
ment/map format and SAMtools. Bioinformatics 2009;25: Lecture Notes in Computer Science, Vol. 5541. Berlin: Springer,
2078–9. 2009;121–37.
63. Stein LD, Mungall C, Shu S, et al. The generic genome
browser: a building block for a model organism system
database. Genome Res 2002;12(10):1599–610.

You might also like