[Haskell-cafe] [darcs #222] performance problem with massive changes in a single file

Mon Feb 21 18:13:43 EST 2005

G'day all.

Quoting David Roundy via RT <bugs at darcs.net>:

> Indeed.  The issue here is that darcs uses the exact Hunt-Szymanski LCS
> algorithm when it computes the diff, while GNU diff uses an approximate
> --but faster--algorithm, which works "almost" as well.  The approximate
> algorithm is reasonably well-documented, but more complex to implement
> than the exact one, just because it's harder to understand.

This isn't entirely true.

The Myers algorithm used in GNU diff is exact.  However, it's also
designed to adapt (if allowed) in the case where it seems to be taking
too long.

GNU diff also goes to some trouble to help the Myers algorithm along.
A common case in real-life diffing is that you have a "block" of
adjacent lines in one file, none of which appear in the other file.
In that case, the block can be combined into one "virtual line", which
reduces the effective length of the diff, making most LCS algorithms go
faster.

A problem with the Myers algorithm is that the diff that it produces
is less "pretty" than that produced by most other means.

> This would be a reasonable project for an interested haskell CS student.
>  The LCS is implemented in pure haskell 98, and is of signature
>
> lcs :: Ord a => [a] -> [a] -> [a]

The returned list is proportional to the length of the smaller of the
input lists, which is, in general, quite large compared with the size
of the patch.  It's probably more efficient to produce a list of edits:

data Source = LeftFile | RightFile

diff :: (Ord a) => [a] -> [a] -> [(SourceFile, Int)]

Where each tuple represents a line which has no correspondence in the
other file.

If anyone is curious, I did a pretty complete implementation of diff
for Mercury:

        http://www.mercury.cs.mu.oz.au/

It should be fairly directly translatable into Haskell if someone doesn't
feel like thinking.

Cheers,
Andrew Bromage