[Haskell-cafe] Re: Haskell version of Norvig's Python Spelling Corrector

Bryan O'Sullivan bos at serpentine.com
Sun Apr 22 12:38:27 EDT 2007

Ketil Malde wrote:

> Hm - nobody suggested using ByteStrings yet?

I wrote an independent port of Norvig's spellchecker, because I figured 
it would make the basis of an interesting tutorial.  For such a small 
program, it's been quite a challenge!

I started out using lazy ByteStrings, Data.Map, and Data.Set.  As Albert 
observed, using Data.Set is poison for heap size and performance.  The 
result of switching from sets to lists was a > 90% reduction in memory 
usage, and a big (but unmeasured) speedup.

After this switch, I found that spellchecking one word still took 4x as 
long in Haskell as Norvig's Python program.  Since I was checking only 
one word in each case, essentially all of the runtime was taken up by 
building the word frequency map.

In my profile results, I find that simply converting words to lower case 
accounts for a whopping 40% of time and allocation (see the attachment 
for my definition of the train function).

COST CENTRE                    MODULE               %time %alloc

lower                          Spell                 40.5   41.2
train                          Spell                 26.3   14.3
mkWords                        Spell                 21.9   24.1

I was interested in building a profile-enabled version of fps to see 
what might be going on inside the library, but was stymied by not 
knowing how to get GHC 6.6's base to coexist peacefully with fps (hiding 
base isn't very practical).

My experiments are available here:

darcs get http://darcs.serpentine.com/spell

Norvig's training data is available from http://norvig.com/big.txt

-------------- next part --------------
A non-text attachment was scrubbed...
Name: train.hs
Type: text/x-haskell
Size: 583 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20070422/fee153f0/train.bin

More information about the Haskell-Cafe mailing list