Paper On Digital Signal Processing On Biosequence

This document discusses using digital signal processing (DSP) techniques like fast Fourier transforms to compare DNA or protein sequences. It aims to compare sequences faster than current methods, in time proportional to N log N rather than N^2. The method counts matches between sequences without considering contiguity. It provides a list of total matches for each alignment. While faster, the method requires statistical analysis to determine significant alignments. An example shows it correctly identifies a known homologous site. The document discusses addressing criticisms of DSP sequence comparison methods.

Uploaded by

Bagas Yanuar Sudrajad

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Paper On Digital Signal Processing On Biosequence

Uploaded by

Bagas Yanuar Sudrajad

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

k.) 1990 Oxford University Press Nucleic Acids Research, Vol. 18, No.

10 3001

Digital signal processing methods for biosequence

comparison
Donald C.Benson
Department of Mathematics, University of California, Davis, CA 95616, USA

Received November 29, 1989; Revised and Accepted April 6, 1990

ABSTRACT
A method is discussed for DNA or protein sequence to be unable to distinguish between consecutive and non-
comparison using a finite field fast Fourier transform, consecutive matches. On the other hand, certain information can
a digital signal processing technique; and statistical be obtained much more rapidly by DSP methods than by standard
methods are discussed for analyzing the output of this methods. For example, the method described here produces a
algorithm. This method compares two sequences of list of the total number of exact matches in each alignment of
length N in computing time proportional to N log N two sequences. This abundance of information requires statistical
compared to N2 for methods currently used. This analysis in order to determine which alignments are significant.
method makes it feasible to compare very long While we cannot compare DSP methods with standard methods
sequences. An example is given to show that the in accomplishing precisely the same tasks, we shall give an
method correctly identifies sites of known homology. example which verifies that the DSP method discussed below
identifies the 'correct' alignments of sequences.
INTRODUCTION Using DSP methods, we have developed an experimental
computer program for DNA sequence comparison. We do not
Major research effort in recent years in the field of nucleic acid have a full-featured, completely documented program suitable
and protein sequencing has resulted in massive databases of for public distribution which we can offer as an alternative for
nucleic acid and protein sequences. Users of these databases often current sequence comparison programs. The current capabilities
wish to determine the degree of similarity between pairs of these of the program are too special to serve as a tool for molecular
biosequences. This article is concerned with the application of biologists. The program is intended to research the potential of
digital signal processing (DSP) methods, especially fast DSP methods. The program compares two DNA sequences, each
convolution algorithms and the fast Fourier transform, to the of length 1024 or less. A DSP method of sequence comparison
problem of searching biosequence databases for similarity. For is inherently faster than current methods because an N log N
an exposition of DSP algorithms see (1). DSP methods are method inevitably gains a speed advantage over an N2 method
currently applied, for example, in the fields of sonar, radar, if N, i. e. the length of the sequences, is large enough; however,
seismology, tomography, and computerized photographic image N = 1024 may not be large enough to demonstrate an advantage
enhancement. We believe that DSP methods are also a rich source over current methods. Presently, our greatest concern is not speed
of techniques for biosequence analysis. To illustrate the but correctness. We wish to show that it is possible to devise
possibilities, this paper puts forth a particular biosequence a DSP program that satisfactorily identifies sites of similarity.
application. The application of the fast Fourier transform to biosequence
The most rapid current methods of sequence comparison are similarity searches has been previously discussed by Felsenstein
very rapid, and there is probably not a generally perceived need et al. in (2). We believe that they underestimate the potential
among molecular biologists for faster methods. However, this usefulness of this method. They state certain problems which we
situation is likely to change as the size of the databases increases. discuss below.
It seems likely that either there will be need to screen sequences a. The method cannot detect insertions or deletions. If insertions
which are orders of magnitude longer than those currently or deletions are present, more than one alignment exhibits an
studied, or there will be a need to do simultaneous screening of unusual number of matches. In this article, we introduce statistical
a large number of sequences. methods which answer this objection by developing statistical tests
DSP methods do not deliver exactly the same kind of to determine if a significant number of alignments exhibit an
information as standard methods. We believe that it is worthwhile unusually large number of matches.
to explore what kind of information can be obtained and to b. The method cannot determine whether matches are
develop statistical methods to analyze this information. DSP consecutive or not. On the other hand, methods which select on
methods are not capable of simply accelerating the speed of the basis of contiguity miss significant alignments in which
algorithms that are currently used for sequence analysis. There matches are -separated by substitutions. DSP methods have not
are steps in the currently used algorithms which seem impossible been researched sufficiently to say that there is no way around
to replicate using DSP methods. For example, DSP methods seem this problem.
3002 Nucleic Acids Research, Vol. 18, No. 10
c. 7he method involves cumbersome calculations with complex C, G, and T for DNA sequences) for each alignment of the two
floating-point numbers. This difficulty is avoided by usingfinite sequences (without regard for contiguity). For each alignment,
field fast Fourier transforms. This avoids the use of complex the sum of the total number of matches is equal to the number
numbers. Only integer arithmetic is involved, and a large of dots in the corresponding diagonal of the dot-matrix. For large
proportion of the multiplications in the algorithm can be reduced sequences, this information can be obtained very much faster than
to bit shifts which are much more rapidly executed on the the dot-matrix.
computer than true multiplications. Our technique differs from those commonly used in that it does
The problem of searching for similarities in strings of not distinguish between contiguous and noncontiguous matching
characters is of general interest in the field of computer science. characters. The advantage is that our method is more sensitive
The problem of searching for exact matches is considered in (3) in detecting string transformations by substitution. The
and (4). Methods of biosequence comparison seek a degree of disadvantage is that the method is not able to give greater
similarity rather than an exact match. See (5) for a review of significance to matches when they are contiguous. Our technique
current methods for biosequence comparison. The most powerful shares with all rapid methods a difficulty in detecting string
methods, the dynamic programming algorithms, model the transformations by insertion or deletion. Nevertheless, when an
process of biological evolution by constructing an optimal series insertion or deletion occurs, similarity may be detected by
of transformations that carry the one sequence through a chain observing a large number of matches in two or more separate
of intermediate sequences ending finally with the second sequence alignments. The overriding advantage of the method is speed.
(6). Each transformation contributes to a similarity score. This method is intended as a screening method. Supplemental
Dynamic programming methods are capable of uncovering subtle dynamic programming methods can be used to confirm and
relationships; however, these methods are too slow for large- elaborate probable sites of similarity.
scale database searches. In the next section, we define the FFT and develop the basic
Rapid methods for sequence comparison generally count the comparison algorithm.
number of matches for each alignment of the sequences. For In the last section, we discuss the statistical analysis of the
example, the dot-natrix method (7) produces a rectangular array output of the comparison algorithm. In particular, we analyze
of dots; a dot appears in the ith column of the jth row if item a test of similarity using extreme values.
i in the first sequence matches item j in the second sequence.
Dots on a particular NW to SE diagonal line represent matches
in the corresponding alignment of the sequences. THE COMPARISON ALGORITHM
Some rapid methods count matches with a suitable weighting Let
factor if a certain matching criterion is met. The matching
criterion, for example, may require that, for some fixed integer (2.1) v = tvj, j= 0,..., n-II
k, k contiguous characters of the two sequences are identical (8). be an vector of complex or real numbers. The discrete Fourier
A weighting factor may give a higher score to matches that are transform of v is defined to be the complex vector V of length
considered less likely (9). Dynamic programming can be used n given by
after an initial screening by means of more rapid methods (9).
In order to use DSP methods, the character sequences are (2.2)
translated into suitable numerical sequences. We obtain similarity Vk = VOWO k + VIwi k + ... + vn_lw(n-)-k, 0 sk <n-I
information from the numerical sequences by convolution, an where w is the primitive nth root of unity, exp(-2-ri/n). The
arithmetic process. The convolutions are then computed by fast number n is called the block-length.
DSP methods. This similarity information is then interpreted by We define the inverse Fourier transforn of a vector U to be
statistical methods. the vector u given by
For comparison of two sequences of length N, the best current
methods use algorithms that require, at least in the worst case, (2.3)
computation time proportional to N2. We present here a method Uj = (UOwO0i + U1w1 j + ... + U"_1w(n-1)ij)/n, 0 sj <n--
with computation time proportional to N log N. This greatly The inverse Fourier transform is a true inverse in the sense
extends our ability to compare very long sequences. that for any vectors v and V related according to (2.1) and (2.2),
The key to our method is the use of a variant of thefast Fourier v is equal to the inverse transform of V.
transform (FFT). See (1), especially chapter 6. In this application Our application of the Fourier transform uses the following
the standard FFT has the disadvantage of replacing mere fact, known as the convolution theorem. For any pair of real
character-by-character comparisons with rather complex floating- or complex n-dimensional vectors f and g define their convolution
point arithmetic. This problem is eliminated by doing the to be the vector e given by
arithmetic in a suitable finite field of integers instead of the field
of complex numbers. Improvement results because integer (2.4)
arithmetic is faster than floating-point arithmetic and because in ej= fU-0go + ffij-1g1 + + fi-n+1gn-1, 0 <j <n-I
---

a suitably chosen finite field most of the multiplications of the where

FFT algorithm become bit shifts which are computed much faster
than true multiplications.
Although our methods also apply to protein sequences, for -k kk if+ kn .-0
otherwise.
simplicity we refer below only to DNA sequences.
Our method can be considered a variant of the dot-matrix The convolution of f and g is denoted f * g.
method described above. However, the output of our method Let F and G be vectors of length n. We define F -
G to be
consists of the total number of matches for each character (A, the vector E given by
Nucleic Acids Research, Vol. 18, No. 10 3003
(2.5) Ek = FkGk, 0 <k n-1 sequence of length M. The sequences are compared by totaling
the number of mnatching paired characters under every possible
The following assertion is called the convolution theorem. Let alignment.
F and G be the Fourier transforms of f and g respectively. Then First consider the case that L = M = n. When we compare
the Fourier transform of f * g is F * G. Equivalently, the inverse the first sequence with an offset of the second sequence, there
Fourier transform of F * G is f * g. are characters at the beginning of the first sequence and at the
For sufficiently large n, use of the convolution theorem is faster end of the second sequence which are not paired with any
than the direct method for computing the convolution of f and character. For the alignment shown below n is 15 and the number
g. Note that it is easier to compute F * G than f * g. In fact, of matches is 4.
F * G requires n multiplications and one addition whereas f * g
requires n2 multiplications and n additions. Moreover, the fast (2.7) ATCACAAGTACCTTA
Fourier transform (FF1) and the inverse FFT require of the order TTGTTAACTAACGTA
of n log n multiplications and additions. More precisely, there This type of comparison is called linear comparison. There
are constants M and A not depending on n such that the number is another type called cyclic comparison in which we consider
of multiplications is less than Mn log n for all n, and the number the first character of each sequence to be successor of the last
of additions is less than An log n for all n. The standard way character. In other words, in cyclic comparison we consider each
of expressing this fact is to say that the number of multiplications of the sequences to be circular. Applying this tpe of comparison
and additions are each O(n log n). As a consequence of the to the above alignment (2.7) adds one more match making a total
above, for sufficiently large n, it is faster to compute the FFT's of 5 instead of 4.
F and G, then F G, and finally the inverse FFT of F * G,
-

than it is to compute f * g directly. (2.8) ATCACAAGTACCTTA

It is advantageous to replace the complex number field in the ACGTATTGTTAACTA
definitions above with a finite field, the field of integers modulo In effect, we are lumping two different alignments in the original
a prime number p. Let w be a number between 0 and p. The formulation, namely the alignment (2.7) and
order of w is defined to be the smallest integer n such that
wn 1 mod p. For vectors of size n of integers, we may (2.9) ATCACAAGTACCTTA
define a Fourier transform, and an inverse Fourier transform by TTGTTAACTAACGTA
reinterpreting the formulas (2.1-2.3). All equalities are If one or both of the sequences is shorter than the block-length,
interpreted as congruence modulo p. We use the w we have just then the sequences must be padded with blanks. Consider alter-
discussed instead of the complex nth root of unity. The ing the above example so that n = 15, L = 9, and M = 7.
convolution theorem holds exactly as before. We obtain exactly Blanks are indicated with '*'.
the same results computing convolutions using this version of
the convolution theorem provided all the numbers involved are (2.7') ATCACAAGT* * * ***
nonnegative integers less than p. We shall call this type of Fourier TTGTTAA ********
transform a finite field Fourier transform (F3T). There are also
fast counterparts of these calledfinitefieldfast Fourier transforms For this example, cyclic comparison and linear comparison
(F4T). We have written a computer program for a F4T using p produce exactly the same number of matches. This is true not
equal to the Fermat prime 216 - 1 = 65,537 and n equal to only for this particular alignment, but for any alignment of these
210 = 1024. See (1) pp. 180-182. With apologies for two sequences, because all of the newly introduced pairs in the
alliterative excess (Ferrmat finite field fast Fourier transform), cyclic comparison have at least one member which is a blank.
we will name this transform F5T. For F5T, most of the (Pairing a blank with a blank does not count as a match.) In
multiplications reduce to bit shifts which are much faster than general the cyclic comparison is the same as the linear comparison
true multiplications. Using an IBM-AT microcomputer our provided that L + M < n + 1. Although cyclic comparison
program computes F5T's of four numerical sequences of length is more suited to the convolution method and has a simpler
1024 simultaneously in about two seconds. statistical analysis, linear comparison is more natural for the
We now give a method of translating sequences of characters biosequence application. The convolution method can be used
into numerical vectors. We will then show how convolutions of for linear comparisons provided that the above inequality is
these vectors bear on the problem of sequence comparison. For satisfied.
definiteness we consider the case of nucleic acid sequences which Suppose L = M = n and that we wish to make a cyclic com-
consist of strings of length L c n of the four characters A, C, parison of the sequence described by the vectors a, c, g, and
G, T. Note that it is permissible for the length L of the sequence t, with a second sequence of the same length. Define a', c', g',
to be less than the block-length n. Let a be the vector given by and t' to be the character vectors for A, C, G, and T, as above,
but for the second sequence of characters in reverse order.
(2.6) a =1 if j c L and the jth character is an A Note that the jth components (0 c j c n-1) of the vectors
0 otherwise, j c n.
(2.10) A = a * a', C = c * c', G = g * g', T = t * t'
Note that if the length of the sequence L is greater than the block- are equal the number of times an A (respectively C, G, T) in
length n, then aj = 0 for j such that L < j < n. the first sequence matches an A (respectively C, G, T) in the
Similarly, define the vectors c, g, and t with respect to the alignment with the second sequence in which the first character
incidence of the characters C, G, and T. Let A, C, G, and T of the first sequence is paired with the (n - j - I)th character
be the respective FFT's. of the second sequence. In other words, the convolution vectors
Consider the problem of comparing this sequence with a second consist of the total number of single-character matches. We will
3004 Nucleic Acids Research, Vol. 18, No. 10
call the convolutions A, C, G, and T match vectors. The FFT not statistically independent. Nevertheless, the means are given
is used to compute the match vectors. by (3. 1a-d), and the variances are given by
These methods can be easily adapted to the computation of (3.2a-d)
weighted averages instead of mere totals. Particularly for protein cc'(N -c)(N -c')
comparisons, weights may be taken large or small according as aa'(N-a)(N-a') 2-
the matches are scarce or widespread, and, for non-identical pairs, N2(N-1) N2(N- 1)
weights are assigned to evaluate the replacement likelihood. The 2 = gg'(N -g)(N- g') 2T = tt'N(N-t1t')
PAM250 matrix (10) provides weights for protein comparisons.
N2(N-1) N2(N -1)
STATISTICAL ANALYSIS In order to compare sample variances of sequences with
differing proportions of A, C, G, and T, we use (3.1) and (3.2)
The match vectors discussed in the previous paragraph can be to standardize the match vectors variables A, C, G, and T.
computed quite rapidly using, for example, F5T. We now
discuss the problem of the statistical analysis of the match vectors. (3.3a-d) A*i = (Ai - AA)IUA
In particular, we want to determine if a particular set of match C* = (Cj - tic)/oc
vectors represents a significant similarity between the two G*1 = (Gi - tG)/'G
sequences from which they were derived. T*i = (Ti -
IT)/IT
Statistical tests can be devised to detect similarity. We discuss (O <i cn-1)
two types of tests, the first using measures of dispersion, the A complication arises if, as in linear comparison, one or both
second using extreme values. A measure of dispersion tests for of the sequences are shorter than the block-length. In that case
average similarity over all alignments. Extreme values identify formula (3.1) needs to be reinterpreted because, for any
the most significant alignments. Current search techniques alignment, matches can occur only in the region in which the
generally adopt the second view of similarity. sequences overlap. The length N in formula (3.1) is the length
of the overlap. Similarly, a, a', c, c', etc., represent the numbers
Mean and Variance of A's, C's, etc. for each of the sequences in the overlapping
For simplicity we first assume that the two sequences to be region. This means that the variables N, a, a', etc., may have
compared are of the same length. The mean of a match vector different values for different alignments. This additional
is determined completely by the number of characters in each information can be obtained using the convolution method.
of the two sequences and the length. In fact, let a and a' be the Formulas (3.1), (3.2), and (3.3) have obvious reinterpretations
number of A's in the first and second sequences, respectively, if the sequences are shorter than the block-length.
and let N be the common length of the sequences. Note that given A significantly high variance of the standardized variables can
a character in the first sequence and a character in the second be taken as an indication of significant similarity. A chi-square
sequence, there is exactly one alignment in which these two test could be used in the case of independent trials. If the non-
characters are matched. There are N alignments, and the number similarity hypothesis specified that each alignment were shuffled
of AA pairs possible is aa'. Thus the mean number of AA pairs independently, then we would have independence and we could
per alignment is use chi-square. Unfortunately, this hypothesis is inconsistent with
the manner in which our data is actually generated. Nevertheless,
(3.la) AA = aa'/N. ignoring this difficulty, we have used chi-square on Genbank
Similarly, the mean number of CC's, GG's, and TT's are, sequence data and have found that the values of chi-square
respectively, correlate with known cases of sequence similarity, but it is
difficult to interpret the meaning of chi-square confidence levels.
(3. lb-d) Ac = cc'/N, AG = gg'/N, ItT = tt'/N.
Every sample has the same mean. This surprising property is
possible because the components of the match vector are not Screening by Extreme Values
statistically independent. Researchers are currently more interested in pinpointing the few
On the other hand, if a measure of dispersion, such as the most significant alignments rather than analyzing the average
variance, is large, it means that some alignments have a relatively significance of a batch of alignments. In this section we consider
large number of matches. This justifies taking a measure of a particular screening procedure for similarity which implements
dispersion as a measure of similarity of the sequences. this concern. If there are alignments for which the number of
There is more than one possible probability model for non- matches exceeds a certain critical value, Z, then we flag those
similarity. Such a model could be based empirically on properties alignments for further analysis, say using a Needleman-Wunsch
observed for actual biosequences. Consequences of various algorithm (6). In using the FFT we examine alignments in batches
models of non-similarity are discussed in (1 1). Here we will base of a certain size which we denote n. We say that a type I error
our further development on a simple model which represents a occurs if in screening a batch of n alignments we overlook
first order approximation. Our model of non-similarity is that significant matches, and a type 2 error occurs if we flag one or
sequences with a given proportion of A's, C's, G's, and T's are more alignments which are not significant. The probabilities of
generated by shuffling randomly the order of the characters while the two types of errors depend on the choice of critical value,
keeping fixed the total number of each character. Under this the lengths of the sequences, etc. In this section we analyze the
hypothesis, the statistical distribution of the number of matches dependence of type 1 and type 2 errors on the various parameters.
for any particular alignment is hypergeometric. (See, for We begin with a discussion of type 2 errors. Our analysis
example, (12) pp. 179 and 218) However, the number of matches depends on extreme value theory. Let Z1, Z2, ... be a sequence
for different alignments of the same two sequences are obviously of independent and identically distributed random variables.
Nucleic Acids Research, Vol. 18, No. 10 3005
Classical extreme value theory (13) determines the asymptotic Table. 1. Type 1 and type 2 errors for sequence of length 256.
distribution of
Critical Probabilities of type 1 error Type 2 error
Mn= max(Z,, Z2, .-- Zn) Value Match length=L No. of alignments=n
z L=30 L=40 L=50 L=60 n= 1024 n=4096
as n tends to infinity. It has recently been shown (14) that the
principal results of this theory remain valid under certain 81 0.1991 0.0205 0.0005 0.0000 1.0000 1.0000
conditions of stationarity and mild dependence. These results have 82 0.2447 0.0297 0.0009 0.0000 0.9992 1.0000
implications for the biosequence comparison problem, at least 83 0.2954 0.0420 0.0015 0.0000 0.9839 1.0000
84 0.3505 0.0580 0.0024 0.0000 0.9105 1.0000
in the case in which one of the sequences being compared is much 85 0.4089 0.0786 0.0040 0.0000 0.7559 0.9991
longer than the other. 86 0.4694 0.1044 0.0063 0.0001 0.5613 0.9796
In order to make use of asymptotic results, we must idealize 87 0.5306 0.1357 0.0098 0.0001 0.3821 0.8848
the comparison problem. We suppose that one of the sequences 88 0.5911 0.1729 0.0149 0.0003 0.2452 0.6987
89 0.6495 0.2160 0.0222 0.0005 0.1515 0.4861
being compared is infinite in length and the other has length M. 90 0.7046 0.2648 0.0321 0.0009 0.0915 0.3089
In order to study type 2 errors, we test a model of non- 91 0.7553 0.3187 0.0456 0.0015 0.0546 0.1854
similarity in which the underlying biosequences are sequences 92 0.8009 0.3767 0.0632 0.0025 0.0322 0.1076
independent random variables. Let z; be the total number of 93 0.8410 0.4376 0.0857 0.0042 0.0190 0.0612
matches for the ith alignment, and let Zi = (zi-,I)/ur where ii 94 0.8754 0.5000 0.1137 0.0067 0.0111 0.0345
95 0.9042 0.5624 0.1478 0.0105 0.0065 0.0193
and ij1 are the mean and variance of z1; i. e. Z, is the 96 0.9278 0.6233 0.1881 0.0160 0.0038 0.0107
standardization of z;. The assumption of stationarity needed for 97 0.9466 0.6813 0.2345 0.0239 0.0022 0.0060
the extension (14) of extreme value theory requires that the joint 98 0.9614 0.7352 0.2867 0.0348 0.0013 0.0033
distributions of 99 0.9726 0.7840 0.3438 0.0495 0.0008 0.0018
100 0.9809 0.8271 0.4046 0.0688 0.0004 0.0010
(3.4) zI'lz2 .
' Zik
and These results are illustrated in Table 1 which lists type 1 and
Zi, +m, Zi2+m, ...* , k+m type 2 errors for p = ¼/4, M = 256, and for various choices
of L and n. For the type 2 error model we assume /A = pM
are the same for any choice of the positive integers k, i,, i2, .... and a2 = p(l - p)M for all z;, and we use the asymptotic
ik, m. This assumption is natural for the application under formulas (3.5) and (3.6) to estimate the probabilities. For
consideration. We make the approximation that (3.4) has a k-
dimensional normal distribution for any choice of k, il, i2, ... example, using a critical value of 91 we incur type 1 and type
ik. Since the sequence has length M, it is not unreasonable to 2 errors of about 5% in searching for matching subsequences
assume further that the covariance cov(Z;, Zj) is equal to zero of length 50 of a sequence of length 256 using batches of size
for all i and j such that i < j + M. This covariance assumption 1024.
and the assumption of stationarity imply (13, p. 84) that for any
real x, the probability of the inequality
MAX ERROR
(3.5) an(Mn-bn) sx BASES OF H MATCHES OFFSET PROBABILITY

tends to exp(-e-x) as n - oo, where 1. 1- 1024 94 91 0.0111

2. 1001- 2024 92 777 0.0322
(3.6) an = (2 log n)"/2, 3. 2001- 3024 141 374 1.2x10-13
4. 3001- 4024 98 209 0.0013
bn = (2 log n)½2-½1/2(2 log n)½-/2(log log n + log 4wr). 5. 4001- 5024 88 960 0.2452
6. 5001- 6024 88 793 0.2452
The foregoing can be interpreted as giving the asymptotic 7. 6001- 7024 131 136 2.6x10-
behavior of the probability of a type 2 error. In fact, if z is the
8. 7001- 8024 131 112 2.6x10-1
9. 8001- 9024 90 709 0.0915
critical value, we simply put x = an(Z bn) in (3.5).
- 10. 9001 -10024 93 300 0.0190
To illustrate the analysis of type 1 errors, we make much more 11. 10001-11024 188 325 1.3 x 10-24
restrictive assumptions. Assume that there is a 'significant' 12. 11001-12024 95 71 0.0065
alignment in which L of the M characters match. We assume Table 2. Comparison of M (MUSHBA.ROD 596-851) with H
that the remaining M - L pairs may be considered independent (HUMHBA4.PRI). BASES OF H lists successive subsequences of H. MAX
Bernoulli trials with probability p of a successful match. The MATCHES is the maximum number of exact matches over all alignments of
distribution of these matches is binomial with mean M with the indicated subsequence of H. OFFSET is the number of bases of offset
that defines the alignment with the maximum number of matches. ERROR PRO-
A = p(M - L) and variance a2 = p(l p)(M - L). -
BABILITY is the probability of a type 2 error, i. e. the probability of the listed
Suppose that the critical value for the number of matches is z, number of matches assuming Bernoulli trials with p = 'A. The boldface entries
and Z = (z - ,u)/u is the standardization of z. The type 1 error are alignments which exhibit the expected homology between mouse and human
is the eventuality that the screening will miss the significant alpha globin exon 2. See text.
alignment of length L, i. e. that the number of matches among As an example of this type of comparison, we compare two
the remaining M - L is less than z - L. Approximating the GenBank sequences. The sequence MUSHBA.ROD (1441 bp)
binomial with the normal distribution we have that the probability according to GenBank documentation is 'Mouse alpha globin
of the type 1 error is equal to Q(Z L/a) where Q is the left
-
gene, complete cds' and HUMHBA4.PRI (12847 bp) is 'Human
tail of the normal probability distribution: alpha globin psi-alpha-i, alpha-2 and alpha-I genes, complete
Q(x) = (2 r)½I J-_O exp(- /2t2 )dt. cdns.' The subsequence MUSHBA.ROD 622-826 consists of
3006 Nucleic Acids Research, Vol. 18, No. 10
alpha globin, exon 2. Enlarging this subsequence slightly, we subsequence of length 75 with 39 or more matches is only
compare MUSHBA.ROD 596-851, which we denote M, with 0.00016. We conclude that (3.7) probably represents a significant
all of HUMHBA4.PRI, which we denote H. As our computer match.
implementation handles sequences of length at most 1024, we
compare M successively with H 1-1024, H 1001-2024,
H 11001-12024. GenBank documentation lists, in part, the ACKNOWLEDGEMENTS
following subsequences of H: We thank Dr. William J. Black, Stanford University, for helpful
a. 6915- 7119 hba2 alpha globin, exon2 suggestions and for providing GenBank DNA sequences. We also
b. 10726-10930 hbal alpha globin, exon2 thank the referees for valuable suggestions.
c. 2697- 2881 pseudo-hbal alpha globin, exon2.
Evidence of the expected homology to M of these subsequences REFERENCES
is shown very clearly in this experiment. We summarize the 1. Blahut, Richard E. (1985) Fast Algorithms for Digital Signal Processing.
results in the Table 2. For each of the 12 subsequences of H Addison-Wesley, Reading, Massachusetts.
we list the alignment with the greatest number of matches. The 2. Felsenstein, J., S. Sawyer, and R. Kochin, (1982) Nucleic Acids Research
10, 133-139.
alignments are identified by their offsets. The H subsequences 3. Knuth, D. E., J. H. Morris, V. R. Pratt (1977) SIAM J. Comp. 6:2 (June),
3, 7, 8, and 11 contain all or part of the above variants c, a, 323-350.
a, and b, respectively, of human alpha globin exon 2. Note that 4. Boyer, R. S., and J. S. Moore, (1977) CACM 20:10 (October), 762-772.
the above statistical test shows extremely small probabilities of 5. Waterman, Michael S. (1984) Bulletin of Mathematical Biology 46, 473-500.
6. Needleman, S. B., and C. D. Wunch (1970) J. Mol. Biol. 48, 444-453.
type 2 errors associated with these subsequences. In other words 7. Maizel, J. V., and R. P. Lenk (1981) Proc. Natl. Acad. Sci. USA 78,
it is almost certain that these subsequences contain significant 7665-7669.
matches with exon 2 of mouse alpha globin. Note that variant 8. Wilbur, W. J. and D. J. Lipman (1983) Proc. Natl. Acad. Sci. USA 80,
a of exon 2 overlaps subsequences 7 and 8. This fact does not 726-730.
9. Lipman, D. J., and W. R. Pearson (1985) Science 227, 1435-1441.
diminish the effectiveness of the statistic in identifying the 10. Dayhoff, M. (1978) Atlas of Protein Sequence and Structure. National
homology; both subsequences are flagged with a very low type Biomedical Research Foundation, Silver Spring, MD.
2 error. Note that the offsets indicated for subsequences 7 and 11. Lipman, D. J., W. J. Wilbur, T. F. Smith, and M. S. Waterman (1984)
8 differ by 24 which is exactly the amount of overlap of the two Nucleic Acids Research 12, 215-226.
subsequences. This confirms the fact that, as expected, the 12. Parzen, Emanuel (1960) Modern Probability Theory and its Applications.
John Wiley, New York.
alignments 7 and 8 are actually different portions of the same 13. Gumbel, E. J. (1958) Statistics of Extremes. Columbia University Press,
alignment, namely the alignment which matches human alpha New York.
globin exon 2 variant a with mouse alpha globin exon 2. 14. Leadbetter, M. R. (1978) In Rosenblatt, M. (ed.) Studies in Mathematics,
Table 2 exhibits other error probabilities that are indicative Vol. 18 (Studies in Probability Theory), The Mathematical Association of
America, Washington, D. C., pp. 46-110.
of sites of homology. In fact, all of the probabilities are less than
1/4. In any case, these probabilities are indicative of more subtle
homology. For example, the probable homology in the align-
ment listed for sequence 4 may be due, at least in part, to bases
3949-4023 of HUMHBA4.PRI which match 39/75 (= 52%)
of the corresponding bases in the specified alignment with
MUSHBA.ROD. In the following listing of HUMHBA4.PRI
3949-4023, capital letters indicate matches in the specified align-
ment with MUSHBA.PRI:
(3.7)
3949 CCtGTGCTGC caGCaACtTC tggaAaCgtC CCtGTcCCcg GTgctgaagt
3999 cctGgaaTcC ATGCtgggAA GtTGCa
GenBank documentation does not give any reason to suspect
homology at this site. In order to estimate the probability that
this much matching (39/75 = 52 %) occurred by chance, we can
reuse the same extreme value theory that we used above in
computing the probabilities of type 2 errors. Suppose that we
divide H and M into subsequences of the same length as the
subsequence exhibited above. The number of subsequences of
H is 12847/75 = 171.29 and of M is 256/75 = 3.41. The total
number of possible comparisons is the product of these numbers
which rounds off to 584. If we count the number of matches for
each of the 584 alignments of subsequences, each of length 75,
and if we assume that each alignment of the subsequences consists
of Bernoulli trials with p = 1/4, what is the probability that at
least one of the alignments will contain at least as many matches
(39) as (3.7)? If this probability is low enough, then we conclude
that the matching exhibited in (3.7) did not occur by chance.
Using (3.5) and (3.6), we find that the probability of at least one