[Haskell-cafe] Re: Haskell version of Norvig's Python Spelling Corrector

Sun Apr 22 16:10:44 EDT 2007

On Sun, 2007-04-22 at 11:51 -0400, Pete Kazmier wrote:

>   type WordFreq   = M.Map B.ByteString Int
>  
>   train:: [B.ByteString] -> WordFreq
>   train words = frequencyMap
>       where
>         frequencyMap     = foldr incWordCount M.empty words
>         incWordCount w m = M.insertWith (+) w 1 m

Isn't this going to build unevaluate thunks for the counts?  This is
what frequency counting looks like in some of my old code:

    -- IM is Data.IntMap
    index ss = foldl' myupdate IM.empty ss
         where myupdate m k = let x = 1 + IM.findWithDefault 0 k m 
                              in x `seq` IM.insert k x m

(edited a bit, but hopefully correct)

I.e. use foldl', and look up previous value, succ and seq it, and put it
back.

Generally, I only seem to want strict Maps.  It seems so easy to build a
lazy one if you need it (data Lazy a = Lazy a), so I wonder if strict
should be the default?

> > Instead of generating the (huge) list of misspelled words, you could
> > calculate edit distance to each correctly spelled word?  With a bound on
> > allowable mistakes, this is a linear time operation using the standard
> > dynamic programming algorithm.
> 
> Could you provide additional information on this "standard dynamic
> programming algorithm?"  I'm not familiar with dynamic programming.

Sorry!  Dynamic programming just means using memoizing on a recursive
function to reduce its run time complexity (although I don't think I've
seen that definition anywhere?).  A simple example is finding the n'th
Fibonacci number by calculating (and referring to) the whole list.

For edit distance, you can do something like:

	data Edit a = Repl a a | Delete a | Insert a

	align (x:xs) (y:ys) = best [ Repl x y : align xs ys,
	                             Insert y : align (x:xs) ys,
                                     Delete x : align xs (y:ys)]

Where 'best' is some fitness function, selecting the best alignment.

Applying this directly gives an exponential algorithm, but since there
are only n*m (n = lenght (x:xs), m = length (y:ys) possible ways to pick
the remainders in the recursive calls throughout the execution, you
might as well store them all in an n*m matrix - which only takes O(nm)
time.

This is probably the worst explanation of DP so far this century, but if
you Google for it, you'll find tons of information.  Wikipedia has an
entry as well.

Anyway: since the number of edits in this case is bounded (to four,
since transposition can be modelled by an insertion and a deletion), you
know the optimal solution will never stray further than two off the
diagonal of the DP matrix, so you only need to caclulate a diagonal band
(linear space/time), not the full matrix (quadratic).

-k