An Algorithm For Differential File Comparison: J. W. Hunt
An Algorithm For Differential File Comparison: J. W. Hunt
J. W. Hunt
Department of Electrical Engineering, Stanford University, Stanford, California
M. D. McIlroy
Bell Laboratories, Murray Hill, New Jersey 07974
ABSTRACT
The program diff reports differences between two files, expressed as a minimal list
of line changes to bring either file into agreement with the other. Diff has been engi-
neered to make efficient use of time and space on typical inputs that arise in vetting ver-
sion-to-version changes in computer-maintained or computer-generated documents. Time
and space usage are observed to vary about as the sum of the file lengths on real data,
although they are known to vary as the product of the file lengths in the worst case.
The central algorithm of diff solves the ‘longest common subsequence problem’ to
find the lines that do not change between files. Practical efficiency is gained by attending
only to certain critical ‘candidate’ matches between the files, the breaking of which
would shorten the longest subsequence common to some pair of initial segments of the
two files. Various techniques of hashing, presorting into equivalence classes, merging by
binary search, and dynamic storage allocation are used to obtain good performance.
[This document was scanned from Bell Laboratories Computing Science Technical
Report #41, dated July 1976. Text was converted by OCR and hand-edited. Figures were
reconstructed. Some OCR errors may remain, especially in tables and equations. Please
report them to [email protected].]
The program diff creates a list of what lines of one file have to be changed to bring it into agreement
with a second file or vice versa. It is based on ideas from several sources[I,2,7,8]. As an example of its
work, consider the two files, listed horizontally for brevity:
a b c d e f g
w a b x y z e
It is easy to see that the first file can be made into the second by the following prescription, in which an
imaginary line 0 is understood at the beginning of each:
append after line 0: w,
change lines 3 through 4, which were: c d
into: x y z,
delete lines 6 through 7, which were: f g.
Going the other way, the first file can be made from the second this way:
delete line I, which was: w,
change lines 4 through 6, which were: x y z
into: c d,
-2-
P0 j = 0 j = 0, . . . , n,
1 + P i−1, j−1 if Ai = B j
P ij = 1 ≤ i ≤ m, 1 ≤ j ≤ n.
max(P i−1, j , P i, j−1 ) if Ai ≠ B j
Then P mn is the length of the desired longest common subsequence. From the whole P ij array that was
generated in calculating P mn it is easy to recover the indices or the elements of a longest common subse-
quence.
-3-
Unfortunately the dynamic program is O(mn) in time, and—even worse—O(mn) in space. Noting
that each row P i of the difference equation is simply determined from P i−1 , D. S. Hirschberg invented a
clever scheme that first calculates P m in O(n) space and then recovers the sequence using no more space
and about as much time again as is needed to find P m [6].
The diff algorithm improves on the simple dynamic program by attending only to essential matches,
the breaking of which would change P. The essential matches, dubbed ‘k-candidates’ by Hirschberg[7],
occur where Ai = B j and P ij > max(P i−1, j, P i, j−1). A k-candidate is a pair of indices (i, j) such that (1)
Ai = B j , (2) a longest common subsequence of length k exists between the first i elements of the first file
and the first j elements of the second, and (3) no common subsequence of length k exists when either i or j
is reduced. A candidate is a pair of indices that is a k-candidate for some k. Evidently a longest common
subsequence can be found among a complete list of candidates.
If (i1, j1) and (i2, j2) with i1 < i2 are both k-candidates, then j1 > j2 . For if j1 = j2 , (i2, j2) would vio-
late condition (3) of the definition; and if j1 < j2 then the common subsequence of length k ending with
(i1, j1) could be extended to a common subsequence of length k + 1 ending with (i2, j2).
The candidate methods have a simple graphical interpretation. In Figure 1 dots mark grid points (i, j)
for which Ai = B j . Because the dots portray an equivalence relation, any two horizontal lines or any two
vertical lines on the figure have either no dots in common or carry exactly the same dots. A common subse-
quence is a set of dots that can be threaded by a strictly monotone increasing curve. Four such curves have
been drawn in the figure. These particular curves have been chosen to thread only (and all) dots that are
candidates. The values of k for these candidates are indicated by transecting curves of constant k. These lat-
ter curves, shown dashed, must all decrease monotonically. The number of candidates is obviously less than
mn, except in trivial cases, and in practical file comparison turns out to be very much less, so the list of can-
didates usually can be stored quite comfortably.
a b c a b b a
6 c
k=4
5 a
k=3
4 b
k=2
3 a
k=1
2 b
1 c
1 2 3 4 5 6 7
Figure 1. Common subsequences and candidates in comparing
abcabba
cbabac
-4-
3. Hashing
To make comparison of reasonably large files (thousands of lines) possible in random access mem-
ory, diff hashes each line into one computer word. This may cause some unequal lines to compare equal.
Assuming the hash function is truly random, the probability of a spurious equality on a given comparison
that should have turned out unequal is 1/M, where the hash values range from 1 to M. A longest common
subsequence of length k determined from hash values can thus be expected to contain about k/M spurious
matches when k << M, so a sequence of length k will be a spurious ‘jackpot’ sequence with probability
about k/M. On our 16-bit machine jackpots on 5000-line files should happen less than 10% of the time and
on 500-line files less than 1% of the time.
Diff guards against jackpots by checking the purported longest common subsequence in the original
files. What remains after spurious equalities are edited out is accepted as an answer even though there is a
small possibility that it is not actually a longest common subsequence. Diff announces jackpots, so these
cases tend to get scrutinized fairly hard. In two years we have had brought to our attention only one jackpot
where an edited longest subsequence was actually short—in that instance short by one.
Complexity
In the worst case, the diff algorithm doesn’t perform substantially better than the trivial dynamic pro-
gram. From Section 2 it follows that the worst case time complexity is dominated by the merging and is in
fact O(mn log m) (although O(m(m + n)) could be achieved). Worst case space complexity is dominated by
the space required for the candidate list, which is O(mn) as can be seen by counting the candidates that
arise in comparing the two files
-5-
a b c a b c a b c ...
a c b a c b a c b ...
This problem is illustrated in Figure 2. When m = n the kite-shaped area in which the candidates lie is 1/2
the total area of the diagram, and (asymptotically} 1/3 of the grid points in the kite are candidates, so the
number of candidates approaches n2 /6 asymptotically.*
In practice, diff works much better than the worst case bounds would indicate. Only rarely are more
than min(m, n) candidates found. In fact an early version with a naive storage allocation algorithm that pro-
vided space for just n candidates first overflowed only after two months of use, during which time it was
probably run more than a hundred times. Thus we have good evidence that in a very large percentage of
practical cases diff requires only linear space.
References
[1] S. C. Johnson, ‘ALTER − A Comdeck Comparing Program,’ Bell Laboratories internal memorandum
1971.
[2] Generalizing from a special case solved by T. G Szymanski[8], H. S. Stone proposed and J. W. Hunt
refined and implemented the first version of the candidate-listing algorithm used by diff and embedded it in
an older framework due to M. D. Mcllroy. A variant of this algorithm was also elaborated by Szyman-
ski[10]. We have had many useful discussions with A. V. Aho and J. D. UIlman. M. E. Lesk moved the pro-
gram from UNIX to OS/360.
[3] ’Tutorial Introduction to QED Text Editor,’ Murray Hill Computing Center MHCC-002.
[4] S. B. Needleman and C. D. Wunsch, ’A General Method Applicable to the Search for Similarities in the
Amino Acid Sequence,’ J Mol BioI 48 (1970) 443-53.
[5] D. Sankoff, ’Matching Sequences Under Deletion/Insertion Constraints’, Proc Nat Acad Sci USA 69
(1972) 4-6.
[6] D. S. Hirschberg, ’A Linear Space Algorithm for Computing Maximal Common Subsequences,’ CACM
18 (1975) 341-3.
[7] D. S. Hirschberg, ’The Longest Common Subsequence Problem,’ Doctoral Thesis, Princeton 1975.
[8] T. G Szymanski, ’A Special Case of the Maximal Common Subsequence Problem,’ Computer Science
Lab TR-170, Princeton University 1975
[9] Michael L. Fredman, ’On Computing the Length of Longest Increasing Subsequences,’ Discrete Math
11 (1975) 29-35.
[10] T. G. Szymanski, ’A Note on the Maximal Common Subsequence Problem,’ submitted for publication.
[The paper finally appeared as H. W. Hunt III and T. G. Szymanski, ‘A fast algorithm for computing longest
common subsequences’, CACM 20 (1977) 350-353.]
[11] The programs called proof , written by E. N. Pinson and M. E. Lesk for UNIX and GECOS use the
heuristic algorithm for differential file comparison.
-7-
Appendix
k ← 0.
K [1] is a fence beyond the last usefully filled element.
6. For i = 1, . . . , m, if P[i] ≠ 0 do merge(K , k, i, E, P[i]) to update K and k (see below).
Steps 7 and 8 get a more convenient representation for the longest common subsequence.
-8-
8. For each element c of the chain of candidates referred to by K [k] and linked by previous references
set
J[c. a] ← c. b.
The nonzero elements of J now pick out a longest common subsequence, possibly including spurious
‘jackpot’ coincidences. The pairings between the two files are given by
{(i, J[i]) | J[i] ≠ 0}.
The next step weeds out jackpots.
9. For i = 1. . . . . m, if J[i] ≠ 0 and line i in file 1 is not equal to line J[i] in file 2, set
J[i] ← 0.
This step requires one synchronized pass through both files.
c ← K [0].
(By handling the equivalence class in reverse order, Szymanski[10] circumvents the need to delay
updating K [r], but generates extra ‘candidates’ that waste space.)
2. Do steps 3 through 6 repeatedly.
3. Let j = E[ p]. serial.
Search K [r: k] for an element K [s] such that K [s] → b < j and K [s + 1] → b > j. (Note that
K is ordered on K [. ] → b, so binary search will work.)
If such an element is found do steps 4 and 5.
4. If K [s + 1] → b > j, simultaneously set
K [r] ← c.
r ← s + 1.
c ← candidate(i, j, K [s]).
k ← k + 1.
Break out of step 2’s loop.
6. If E[ p]. last = true, break out of step 2’s loop.
Otherwise set p ← p + 1.
7. Set K [r] ← c.