[Haskell-cafe] Re: Memoizing longest-common-subsequence
Bayley, Alistair
Alistair_Bayley at invescoperpetual.co.uk
Mon Aug 21 05:44:45 EDT 2006
> From: Jared Updike [mailto:jupdike at gmail.com]
>
> Ian's LCS code looks like it works fine in terms of space usage, but
> for large input (lengrh as == 40000, length bs == 50000) it seems to
> be way too slow in terms of time complexity (GNU sdiff executes on
> this same data in a few seconds, the Hunt-Szymanski LCS algorithm
> didn't terminate, even after like 10 minutes).
>
> If it's not too much trouble, could you send Myers STArray based code?
>
> Thanks,
> Jared.
(Sorry for the late reply; have been on holiday.)
Included below. I used to have a unit test module, and also a pure
DiffArray-based implementation for performance comparison (it was
sloooow), but I lost them when my laptop died (a harsh lesson in
infrequent backups), so all I have left is the implementation module
now.
I've used it to diff fairly large files (hundreds of K's, if not Megs)
where there were few differences. It seemed to perform OK, and in cases
where GNU diff (or whatever comes with MSYS) failed.
Alistair
module Diff (diff, diffArray, diffSTArray, Modification(..)) where
{-
See:
http://www.eecs.berkeley.edu/~gene/Papers/np_diff.pdf
http://www.cs.arizona.edu/~gene/PAPERS/np_diff.ps
(An O(NP) sequence comparison algorithm)
The algorithm in the paper is:
compare(src, dst)
M := length src; N := length dst; delta := N - M;
fp[-(M+1)..(N+1)] := -1;
p := -1;
repeat
p++;
for k := -p to delta-1 do
fp[k] := snake(k, max(fp[k-1] + 1, fp[k+1]));
for k := delta+p downto delta+1 by -1 do
fp[k] := snake(k, max(fp[k-1] + 1, fp[k+1]));
k := delta;
fp[k] := snake(k, max(fp[k-1] + 1, fp[k+1]));
until fp[delta] = N
print "edit distance " + (delta + 2*p)
end;
snake(k, y)
x := y - k;
while x < M and y < N and src[x+1] = B[y+1] do
x := x + 1; y := y + 1;
return y;
end;
Notes:
1. In the paper, y refers to the column position,
and x is the row number.
This is not conventional! i.e. x and y are swapped.
In this implementation we use x and y in the conventional
manner, and store the x value in the vertex array.
3. The algorithm as stated in the paper does not indicate
how to determine when a move down or right occurs.
I've assumed it's any update to a diagonal,
where the new x or y value is within the dst or src bounds.
4. We need to set vtx[0] = 0. This handles the case where the
destination string is empty.
In this case, the first move should be a delete i.e. a move down.
If vtx[0] = -1 then the nextEdit function would choose
a move right to diagonal 0 (this would put us further along
the diagonal than a move down), but we only want to move down,
so this is the wrong move. Hence we set vtx[0] = 0.
Note that the algorithm appears to work for all other cases
(including empty src and non-empty dst) when vtx[0] is
either 0 or -1.
All other entries of vtx[] are initialised to -1.
5. The algorithm in the paper only handles the case where
length dst >= length src
i.e. where delta >= 0.
How do we modify it to handle delta < 0?
The rules for generating digaonals for delta >= 0 are:
[-p up-to d-1] (below delta)
[p+d downto d+1] (above delta)
[d]
And the rules for delta < 0:
[p downto d+1] (above delta)
[-d-p up-to d-1] (below delta)
[d]
These are derived from the "recurrence relation" which is:
x[k,p] =
k < delta | snake( max( x[k-1,p ] + 1, x[k+1,p-1] ) )
k > delta | snake( max( x[k-1,p ] + 1, x[k+1,p ] ) )
k = delta | snake( max( x[k-1,p-1] + 1, x[k+1,p ] ) )
Look at patterns for examining diagonals:
delta = 0:
0 | p=0: 0 p=1: -1,1,0 p=2: -2,-1,2,1,0 p=3: -3,-2,-1,3,2,1,0 p=4:
-4,...
delta > 0:
1 | p=0: 0,1 p=1: -1,0,2,1 p=2: -2,-1,0,3,2,1 p=3: -3,-2,-1,0,4,3,2,1
p=3: -4,...
2 | p=0: 0,1,2 p=1: -1,0,1,3,2 p=2: -2,-1,0,1,4,3,2 p=3: -3,...
First iteration goes from 0 to delta (outside in).
Let's see what pattern is like for delta < 0.
delta < 0
-1| p=0: 0,-1 p=1: 1,0,-2,-1 p=2: 2,1,0,-3,-2,-1
-2| p=0: 0,-1,-2 p=1: 1,0,-2,-1 p=2: 2,1,0,-3,-2,-1
-}
import Data.Array.ST as STA
import Data.Array
import Control.Monad.ST
data Modification a = Ins !Int !a | Del !Int !a
deriving (Show, Eq)
inBounds arr i = case STA.bounds arr of (l, u) -> i >= l && i <= u
upperBound arr = case STA.bounds arr of (_, u) -> u
isUpperBound arr i = i == upperBound arr
-- list interface
diff :: Eq a => [a] -> [a] -> [Modification a]
diff [] [] = []
diff src dst = runST ( do
srca <- makeArray1 src
dsta <- makeArray1 dst
diffSTArray srca dsta )
-- Array interface
diffArray :: Eq a => Array Int a -> Array Int a -> [Modification a]
diffArray src dst = runST ( do
-- We can use unsafeThaw because we don't modify src or dst
srca <- unsafeThaw src
dsta <- unsafeThaw dst
diffSTArray srca dsta )
-- STArray interface
diffSTArray :: Eq a => STArray s Int a -> STArray s Int a -> ST s
[Modification a]
diffSTArray src dst = do
let
boundLower = 1 + upperBound src
boundUpper = 1 + upperBound dst
delta = boundUpper - boundLower
vtx <- makeArray boundLower boundUpper (-1)
writeArray vtx 0 0
edt <- makeArray boundLower boundUpper []
diffLoop src dst vtx edt 0 delta
-- Need to help type-checker with array types
makeArray :: Int -> Int -> a -> ST s (STArray s Int a)
makeArray lower upper val = newArray (-lower, upper) val
makeArray1 :: [a] -> ST s (STArray s Int a)
makeArray1 str = newListArray (1, length str) str
diffLoop src dst vtx edt p delta = do
updateForP src dst vtx edt p delta
fin <- finished dst vtx delta
if fin
then do
l <- readArray edt delta
return (reverse l)
else diffLoop src dst vtx edt (p+1) delta
finished dst vtx delta = do
x <- readArray vtx delta
return (isUpperBound dst x)
makeDiagsIncr from to = [from..to]
makeDiagsDecr from to = reverse [to..from]
-- process diagonals within p-band in certain order:
-- 1. those below diagonal delta, from outside in
-- 2. those above diagonal delta, from outside in
-- 3. diagonal delta
updateForP src dst vtx edt p delta =
if delta >= 0
then do
forDiagonals src dst vtx edt (makeDiagsIncr (-p) (delta-1))
forDiagonals src dst vtx edt (makeDiagsDecr (p+delta) (delta+1))
forDiagonals src dst vtx edt [delta]
else do
forDiagonals src dst vtx edt (makeDiagsDecr p (delta+1))
forDiagonals src dst vtx edt (makeDiagsIncr (-delta-p) (delta-1))
forDiagonals src dst vtx edt [delta]
forDiagonals src dst vtx edits [] = return ()
forDiagonals src dst vtx edits (d:diags) = do
nextEdit src dst vtx edits d
forDiagonals src dst vtx edits diags
-- Move down is delete, move right is insert.
-- Note that the position reported for Insert is relative to the
-- src (y) axis, not the dst axis.
-- This is because we expect to report the edit script as changes
-- relative to the src sequence.
-- type-sig is a big help with space usage - squashes unnecessary
polymorphism.
nextEdit :: Eq a => STArray s Int a -> STArray s Int a -> STArray s Int
Int -> STArray s Int [Modification a] -> Int -> ST s ()
nextEdit src dst vtx edits diag = do
mr <- readArray vtx (diag-1)
moveDown <- readArray vtx (diag+1)
let
moveRight = 1 + mr
x = if moveDown >= moveRight then moveDown else moveRight
y = x - diag
newX <- slideDownDiagonal src dst x diag
writeArray vtx diag newX
if moveDown >= moveRight
then do
editList <- makeEdit src y Del y edits (diag+1)
writeArray edits diag editList
else do
editList <- makeEdit dst x Ins (y+1) edits (diag-1)
writeArray edits diag editList
makeEdit :: Eq a => STArray s Int a -> Int -> (Int -> a -> Modification
a) -> Int -> STArray s Int [Modification a] -> Int -> ST s [Modification
a]
makeEdit str i cons y edits diag = do
es <- readArray edits diag
if inBounds str i
then do
e <- readArray str i
return ((cons y e):es)
else return es
slideDownDiagonal :: Eq a => STArray s Int a -> STArray s Int a -> Int
-> Int -> ST s Int
slideDownDiagonal src dst x' dia = do
let x = x' + 1; y = x - dia
if (inBounds dst x) && (inBounds src y)
then do
a <- readArray dst x; b <- readArray src y
if a == b
then slideDownDiagonal src dst x dia
else return x'
else return x'
*****************************************************************
Confidentiality Note: The information contained in this message,
and any attachments, may contain confidential and/or privileged
material. It is intended solely for the person(s) or entity to
which it is addressed. Any review, retransmission, dissemination,
or taking of any action in reliance upon this information by
persons or entities other than the intended recipient(s) is
prohibited. If you received this in error, please contact the
sender and delete the material from any computer.
*****************************************************************
More information about the Haskell-Cafe
mailing list