Paper On Digital Signal Processing On Biosequence
Paper On Digital Signal Processing On Biosequence
10 3001
ABSTRACT
A method is discussed for DNA or protein sequence to be unable to distinguish between consecutive and non-
comparison using a finite field fast Fourier transform, consecutive matches. On the other hand, certain information can
a digital signal processing technique; and statistical be obtained much more rapidly by DSP methods than by standard
methods are discussed for analyzing the output of this methods. For example, the method described here produces a
algorithm. This method compares two sequences of list of the total number of exact matches in each alignment of
length N in computing time proportional to N log N two sequences. This abundance of information requires statistical
compared to N2 for methods currently used. This analysis in order to determine which alignments are significant.
method makes it feasible to compare very long While we cannot compare DSP methods with standard methods
sequences. An example is given to show that the in accomplishing precisely the same tasks, we shall give an
method correctly identifies sites of known homology. example which verifies that the DSP method discussed below
identifies the 'correct' alignments of sequences.
INTRODUCTION Using DSP methods, we have developed an experimental
computer program for DNA sequence comparison. We do not
Major research effort in recent years in the field of nucleic acid have a full-featured, completely documented program suitable
and protein sequencing has resulted in massive databases of for public distribution which we can offer as an alternative for
nucleic acid and protein sequences. Users of these databases often current sequence comparison programs. The current capabilities
wish to determine the degree of similarity between pairs of these of the program are too special to serve as a tool for molecular
biosequences. This article is concerned with the application of biologists. The program is intended to research the potential of
digital signal processing (DSP) methods, especially fast DSP methods. The program compares two DNA sequences, each
convolution algorithms and the fast Fourier transform, to the of length 1024 or less. A DSP method of sequence comparison
problem of searching biosequence databases for similarity. For is inherently faster than current methods because an N log N
an exposition of DSP algorithms see (1). DSP methods are method inevitably gains a speed advantage over an N2 method
currently applied, for example, in the fields of sonar, radar, if N, i. e. the length of the sequences, is large enough; however,
seismology, tomography, and computerized photographic image N = 1024 may not be large enough to demonstrate an advantage
enhancement. We believe that DSP methods are also a rich source over current methods. Presently, our greatest concern is not speed
of techniques for biosequence analysis. To illustrate the but correctness. We wish to show that it is possible to devise
possibilities, this paper puts forth a particular biosequence a DSP program that satisfactorily identifies sites of similarity.
application. The application of the fast Fourier transform to biosequence
The most rapid current methods of sequence comparison are similarity searches has been previously discussed by Felsenstein
very rapid, and there is probably not a generally perceived need et al. in (2). We believe that they underestimate the potential
among molecular biologists for faster methods. However, this usefulness of this method. They state certain problems which we
situation is likely to change as the size of the databases increases. discuss below.
It seems likely that either there will be need to screen sequences a. The method cannot detect insertions or deletions. If insertions
which are orders of magnitude longer than those currently or deletions are present, more than one alignment exhibits an
studied, or there will be a need to do simultaneous screening of unusual number of matches. In this article, we introduce statistical
a large number of sequences. methods which answer this objection by developing statistical tests
DSP methods do not deliver exactly the same kind of to determine if a significant number of alignments exhibit an
information as standard methods. We believe that it is worthwhile unusually large number of matches.
to explore what kind of information can be obtained and to b. The method cannot determine whether matches are
develop statistical methods to analyze this information. DSP consecutive or not. On the other hand, methods which select on
methods are not capable of simply accelerating the speed of the basis of contiguity miss significant alignments in which
algorithms that are currently used for sequence analysis. There matches are -separated by substitutions. DSP methods have not
are steps in the currently used algorithms which seem impossible been researched sufficiently to say that there is no way around
to replicate using DSP methods. For example, DSP methods seem this problem.
3002 Nucleic Acids Research, Vol. 18, No. 10
c. 7he method involves cumbersome calculations with complex C, G, and T for DNA sequences) for each alignment of the two
floating-point numbers. This difficulty is avoided by usingfinite sequences (without regard for contiguity). For each alignment,
field fast Fourier transforms. This avoids the use of complex the sum of the total number of matches is equal to the number
numbers. Only integer arithmetic is involved, and a large of dots in the corresponding diagonal of the dot-matrix. For large
proportion of the multiplications in the algorithm can be reduced sequences, this information can be obtained very much faster than
to bit shifts which are much more rapidly executed on the the dot-matrix.
computer than true multiplications. Our technique differs from those commonly used in that it does
The problem of searching for similarities in strings of not distinguish between contiguous and noncontiguous matching
characters is of general interest in the field of computer science. characters. The advantage is that our method is more sensitive
The problem of searching for exact matches is considered in (3) in detecting string transformations by substitution. The
and (4). Methods of biosequence comparison seek a degree of disadvantage is that the method is not able to give greater
similarity rather than an exact match. See (5) for a review of significance to matches when they are contiguous. Our technique
current methods for biosequence comparison. The most powerful shares with all rapid methods a difficulty in detecting string
methods, the dynamic programming algorithms, model the transformations by insertion or deletion. Nevertheless, when an
process of biological evolution by constructing an optimal series insertion or deletion occurs, similarity may be detected by
of transformations that carry the one sequence through a chain observing a large number of matches in two or more separate
of intermediate sequences ending finally with the second sequence alignments. The overriding advantage of the method is speed.
(6). Each transformation contributes to a similarity score. This method is intended as a screening method. Supplemental
Dynamic programming methods are capable of uncovering subtle dynamic programming methods can be used to confirm and
relationships; however, these methods are too slow for large- elaborate probable sites of similarity.
scale database searches. In the next section, we define the FFT and develop the basic
Rapid methods for sequence comparison generally count the comparison algorithm.
number of matches for each alignment of the sequences. For In the last section, we discuss the statistical analysis of the
example, the dot-natrix method (7) produces a rectangular array output of the comparison algorithm. In particular, we analyze
of dots; a dot appears in the ith column of the jth row if item a test of similarity using extreme values.
i in the first sequence matches item j in the second sequence.
Dots on a particular NW to SE diagonal line represent matches
in the corresponding alignment of the sequences. THE COMPARISON ALGORITHM
Some rapid methods count matches with a suitable weighting Let
factor if a certain matching criterion is met. The matching
criterion, for example, may require that, for some fixed integer (2.1) v = tvj, j= 0,..., n-II
k, k contiguous characters of the two sequences are identical (8). be an vector of complex or real numbers. The discrete Fourier
A weighting factor may give a higher score to matches that are transform of v is defined to be the complex vector V of length
considered less likely (9). Dynamic programming can be used n given by
after an initial screening by means of more rapid methods (9).
In order to use DSP methods, the character sequences are (2.2)
translated into suitable numerical sequences. We obtain similarity Vk = VOWO k + VIwi k + ... + vn_lw(n-)-k, 0 sk <n-I
information from the numerical sequences by convolution, an where w is the primitive nth root of unity, exp(-2-ri/n). The
arithmetic process. The convolutions are then computed by fast number n is called the block-length.
DSP methods. This similarity information is then interpreted by We define the inverse Fourier transforn of a vector U to be
statistical methods. the vector u given by
For comparison of two sequences of length N, the best current
methods use algorithms that require, at least in the worst case, (2.3)
computation time proportional to N2. We present here a method Uj = (UOwO0i + U1w1 j + ... + U"_1w(n-1)ij)/n, 0 sj <n--
with computation time proportional to N log N. This greatly The inverse Fourier transform is a true inverse in the sense
extends our ability to compare very long sequences. that for any vectors v and V related according to (2.1) and (2.2),
The key to our method is the use of a variant of thefast Fourier v is equal to the inverse transform of V.
transform (FFT). See (1), especially chapter 6. In this application Our application of the Fourier transform uses the following
the standard FFT has the disadvantage of replacing mere fact, known as the convolution theorem. For any pair of real
character-by-character comparisons with rather complex floating- or complex n-dimensional vectors f and g define their convolution
point arithmetic. This problem is eliminated by doing the to be the vector e given by
arithmetic in a suitable finite field of integers instead of the field
of complex numbers. Improvement results because integer (2.4)
arithmetic is faster than floating-point arithmetic and because in ej= fU-0go + ffij-1g1 + + fi-n+1gn-1, 0 <j <n-I
---