[Haskell-cafe] haskell in online contests
Daniel Fischer
daniel.is.fischer at web.de
Sun Nov 29 03:53:31 EST 2009
Am Samstag 28 November 2009 21:21:20 schrieb vishnu:
> this is where I've gotten to.
> http://moonpatio.com/fastcgi/hpaste.fcgi/view?id=5120#a5120
> strangely enough Ive gotten no speedup at all from the substitution cost
> UArray (though I had to make it Int, Int to deal with digits.).
Converting the characters with letterValue takes time. I'm a little surprised that it
takes so much time, though. I would have expected it to still be faster than Map.
If you make
subArray :: UArray (Char,Char) Int
subArray = array (('0','0'),('z'z')) ...
you avoid the conversion at the price of a larger array. It's still small enough to have
the entire computation data in the cache, so it should be faster.
However, the Chars are converted to Ints for array-indexing anyway (I think Char is
internally represented as a machine integer [wrapped in a constructor], so this is
basically a no-op, even if not, it's going to be much faster than letterValue), so why not
avoid the conversions (except once on reading) completely and work with Ints?
Change all (UArray Int Char) to (UArray Int Int) and let
getArray :: BS.ByteString -> UArray Int Int
getArray xs = listArray (1, fromIntegral (BS.length xs)) (map letterValue $ BS.unpack xs)
replace
substitutionCost (orig ! i) (new ! j)
with
subArray!(orig!i,new!j)
and remove substitutionCost (optional), that's all you need to change in your code -
except: please fix the typo
-- calculating the Leveishtein distance as described here
...................^^^^^^^^^^^^^^^^^^
:)
> But still I wonder if there's something else I missed. Im really curious what lazyness
> you used to go from 60 to 1.6? I always thought lazyness was automatic and
> seq made strictness possible.
What you need is a sufficiently lazy *algorithm* to compute (min 3 $ distance orig new).
For top speed, you must implement that algorithm sufficiently strictly ;)
You might want to read carefully the "Possible improvements" section on WP to get an idea.
I'll try to explain without giving too much away to respect the spirit of the codechef
challenge.
The Levenshtein algorithm for computing the cost of the cheapest editing sequence(s)
transforming start (length m) into target (length n) computes the lowest costs for
transforming initial sequences of start (length i) to initial sequences of target (length
j), i ranging from 0 to m, j from 0 to n, altogether (m+1)*(n+1) costs.
The costs for i == 0 or j == 0 are easily determined and if you calculate the costs in an
appropriate order, calculating each cost is cheap.
We are only interested in whether the cost (distance) is 0, 1, 2 or larger than 2.
So whenever we stray more than two steps from the diagonal, we can stop.
You approximate that behaviour by writing the value 3 to all cells far enough off the
diagonal.
But you're still writing (m+1)*(n+1) values/thunks to the array. Since the actual
calculation of the costs is cheap, you don't win very much (cuts down execution time by a
little more than half - not bad, but much more is possible).
Also, you're always walking down the entire diagonal, even if one can see much sooner that
the cost is larger than 2. Consider
thisends -> herestop
The last letters differ, so the cost is one of
a) 2+cost (thisend -> heresto) -- substitution (s,p)
b) 1+cost (thisends -> heresto) -- insert p
c) 1+cost (thisend -> herestop) -- delete s
a) last letters differ, another branch adding at least 1 to the cost, so after the second
step we know that route leads to a total larger than 2
b) and c) need three steps to ascertain that the total cost exceeds 2
Now for long strings with large Levenshtein distance, this is typical (occasionally you'll
encounter identical letters, but that doesn't take much time since it doesn't involve a
branching), after three levels of branching, you know the cost exceeds 2, no need to go
further.
So a properly lazy algorithm stops processing as soon as it's certain that the distance is
larger than 2.
One way to do it is to calculate the distance using lazy Peano numbers and checking
whether it's larger than 2:
---------------------------------------------------------------------
data Nat
= Zero
| Succ Nat
n2i :: Nat -> Int
n2i (Succ n) = 1 + n2i n
n2i _ = 0
i2n :: Int -> Nat
i2n 0 = Zero
i2n n = Succ (i2n (n-1))
minN :: Nat -> Nat -> Nat
minN (Succ m) (Succ n) = Succ (minN m n)
minN _ _ = Zero
ldistance :: UArray Int Char -> UArray Int Char -> Nat
ldistance orig new = minN (Succ (Succ (Succ Zero))) $ go m n
where
m = snd $ bounds orig
n = snd $ bounds new
go i j
| i == 0 = i2n j
| j == 0 = i2n i
| a == b = go (i-1) (j-1)
| otherwise = let h = costArray!(a,b)
x = case h of
1 -> Succ (go (i-1) (j-1))
2 -> Succ (Succ (go (i-1) (j-1)))
y = Succ (go i (j-1))
z = Succ (go (i-1) j)
in minN x (minN y z)
where
a = orig!i
b = new!j
distance :: UArray Int Char -> UArray Int Char -> Int
distance orig new = n2i $ ldistance orig new
---------------------------------------------------------------------
Performance still sucks (20s), partly because Nat is slow, partly because I intentionally
pessimised (5s without that pessimisation), but it is a sufficiently lazy algorithm
(almost, there's still a way to stop even earlier).
Now find an efficient way to break early (hint: don't use a lazy datatype, use Int).
> thanks
> Vishnu
More information about the Haskell-Cafe
mailing list