[Haskell-cafe] haskell in online contests

Sun Nov 29 03:53:31 EST 2009

Am Samstag 28 November 2009 21:21:20 schrieb vishnu:
> this is where I've gotten to.
> http://moonpatio.com/fastcgi/hpaste.fcgi/view?id=5120#a5120
> strangely enough Ive gotten no speedup at all from the substitution cost
> UArray (though I had to make it Int, Int to deal with digits.).

Converting the characters with letterValue takes time. I'm a little surprised that it 
takes so much time, though. I would have expected it to still be faster than Map.

If you make 

subArray :: UArray (Char,Char) Int
subArray = array (('0','0'),('z'z')) ...

you avoid the conversion at the price of a larger array. It's still small enough to have 
the entire computation data in the cache, so it should be faster.

However, the Chars are converted to Ints for array-indexing anyway (I think Char is 
internally represented as a machine integer [wrapped in a constructor], so this is 
basically a no-op, even if not, it's going to be much faster than letterValue), so why not 
avoid the conversions (except once on reading) completely and work with Ints?

Change all (UArray Int Char) to (UArray Int Int) and let

getArray :: BS.ByteString -> UArray Int Int
getArray xs = listArray (1, fromIntegral (BS.length xs)) (map letterValue $ BS.unpack xs)

replace

substitutionCost (orig ! i) (new ! j)

with

subArray!(orig!i,new!j)

and remove substitutionCost (optional), that's all you need to change in your code - 
except: please fix the typo

-- calculating the Leveishtein distance as described here 
...................^^^^^^^^^^^^^^^^^^ 

:)

> But still I wonder if there's something else I missed. Im really curious what lazyness
> you used to go from 60 to 1.6? I always thought lazyness was automatic and
> seq made strictness possible.

What you need is a sufficiently lazy *algorithm* to compute (min 3 $ distance orig new).
For top speed, you must implement that algorithm sufficiently strictly ;)
You might want to read carefully the "Possible improvements" section on WP to get an idea.

I'll try to explain without giving too much away to respect the spirit of the codechef 
challenge.

The Levenshtein algorithm for computing the cost of the cheapest editing sequence(s) 
transforming start (length m) into target (length n) computes the lowest costs for 
transforming initial sequences of start (length i) to initial sequences of target (length 
j), i ranging from 0 to m, j from 0 to n, altogether (m+1)*(n+1) costs.
The costs for i == 0 or j == 0 are easily determined and if you calculate the costs in an 
appropriate order, calculating each cost is cheap.

We are only interested in whether the cost (distance) is 0, 1, 2 or larger than 2.
So whenever we stray more than two steps from the diagonal, we can stop.
You approximate that behaviour by writing the value 3 to all cells far enough off the 
diagonal.
But you're still writing (m+1)*(n+1) values/thunks to the array. Since the actual 
calculation of the costs is cheap, you don't win very much (cuts down execution time by a 
little more than half - not bad, but much more is possible).

Also, you're always walking down the entire diagonal, even if one can see much sooner that 
the cost is larger than 2. Consider

thisends -> herestop

The last letters differ, so the cost is one of
a) 2+cost (thisend -> heresto)    -- substitution (s,p)
b) 1+cost (thisends -> heresto)   -- insert p
c) 1+cost (thisend -> herestop)   -- delete s

a) last letters differ, another branch adding at least 1 to the cost, so after the second 
step we know that route leads to a total larger than 2
b) and c) need three steps to ascertain that the total cost exceeds 2

Now for long strings with large Levenshtein distance, this is typical (occasionally you'll 
encounter identical letters, but that doesn't take much time since it doesn't involve a 
branching), after three levels of branching, you know the cost exceeds 2, no need to go 
further.

So a properly lazy algorithm stops processing as soon as it's certain that the distance is 
larger than 2.

One way to do it is to calculate the distance using lazy Peano numbers and checking 
whether it's larger than 2:

---------------------------------------------------------------------

data Nat
    = Zero
    | Succ Nat

n2i :: Nat -> Int
n2i (Succ n) = 1 + n2i n
n2i _ = 0

i2n :: Int -> Nat
i2n 0 = Zero
i2n n = Succ (i2n (n-1))

minN :: Nat -> Nat -> Nat
minN (Succ m) (Succ n) = Succ (minN m n)
minN _ _ = Zero

ldistance :: UArray Int Char -> UArray Int Char -> Nat
ldistance orig new = minN (Succ (Succ (Succ Zero))) $ go m n
      where
        m = snd $ bounds orig
        n = snd $ bounds new
        go i j
            | i == 0    = i2n j
            | j == 0    = i2n i
            | a == b    = go (i-1) (j-1)
            | otherwise = let h = costArray!(a,b)
                              x = case h of
                                    1 -> Succ (go (i-1) (j-1))
                                    2 -> Succ (Succ (go (i-1) (j-1)))
                              y = Succ (go i (j-1))
                              z = Succ (go (i-1) j)
                          in minN x (minN y z)
              where
                a = orig!i
                b = new!j

distance :: UArray Int Char -> UArray Int Char -> Int
distance orig new = n2i $ ldistance orig new

---------------------------------------------------------------------

Performance still sucks (20s), partly because Nat is slow, partly because I intentionally 
pessimised (5s without that pessimisation), but it is a sufficiently lazy algorithm 
(almost, there's still a way to stop even earlier).
Now find an efficient way to break early (hint: don't use a lazy datatype, use Int).

> thanks
> Vishnu