[Haskell-cafe] Re: Haskell version of Norvig's Python Spelling Corrector

Pete Kazmier pete-expires-20070615 at kazmier.com
Sun Apr 22 11:51:02 EDT 2007

Ketil Malde <ketil at ii.uib.no> writes:

> On Sun, 2007-04-22 at 00:25 -0400, Pete Kazmier wrote:
>> Pete Kazmier <pete-expires-20070615 at kazmier.com> writes:
>> > I'd love to see other Haskell implementations as well if anyone has a
>> > few moments to spare.  Admittedly, it took me several hours to get my
>> > version working, but I'm a Haskell newbie.  Unfortunately, I think it
>> > runs as slow as it took me to write it!
> Hm - nobody suggested using ByteStrings yet?  String is notoriously
> wasteful, and replacing it with ByteString normally gives a quite
> significant speedup.

I actually have a ByteString version but it runs much slower.  This
part of the code is where all of the time is spent in the ByteString

  type WordFreq   = M.Map B.ByteString Int
  train:: [B.ByteString] -> WordFreq
  train words = frequencyMap
        frequencyMap     = foldr incWordCount M.empty words
        incWordCount w m = M.insertWith (+) w 1 m
> Worse - and this is true for ByteStrings, too - String comparisons are
> O(n), which means lookups in Sets and Maps are expensive.  A trie (i.e,
> a search tree where each internal node corresponds to a word prefix, and
> has one branch per letter in the alphabet) will give you lookup that
> scales with word size (and possibly alphabet size).

Right.  My first version was just a direct translation of Norvig's
code with an emphasis on trying to keep the complexity and size of
code to a minimum.

> Instead of generating the (huge) list of misspelled words, you could
> calculate edit distance to each correctly spelled word?  With a bound on
> allowable mistakes, this is a linear time operation using the standard
> dynamic programming algorithm.

Could you provide additional information on this "standard dynamic
programming algorithm?"  I'm not familiar with dynamic programming.


More information about the Haskell-Cafe mailing list