[Haskell-cafe] Re: Haskell version of Norvig's Python Spelling
Corrector
Bryan O'Sullivan
bos at serpentine.com
Sun Apr 22 12:38:27 EDT 2007
Ketil Malde wrote:
> Hm - nobody suggested using ByteStrings yet?
I wrote an independent port of Norvig's spellchecker, because I figured
it would make the basis of an interesting tutorial. For such a small
program, it's been quite a challenge!
I started out using lazy ByteStrings, Data.Map, and Data.Set. As Albert
observed, using Data.Set is poison for heap size and performance. The
result of switching from sets to lists was a > 90% reduction in memory
usage, and a big (but unmeasured) speedup.
After this switch, I found that spellchecking one word still took 4x as
long in Haskell as Norvig's Python program. Since I was checking only
one word in each case, essentially all of the runtime was taken up by
building the word frequency map.
In my profile results, I find that simply converting words to lower case
accounts for a whopping 40% of time and allocation (see the attachment
for my definition of the train function).
COST CENTRE MODULE %time %alloc
lower Spell 40.5 41.2
train Spell 26.3 14.3
mkWords Spell 21.9 24.1
I was interested in building a profile-enabled version of fps to see
what might be going on inside the library, but was stymied by not
knowing how to get GHC 6.6's base to coexist peacefully with fps (hiding
base isn't very practical).
My experiments are available here:
darcs get http://darcs.serpentine.com/spell
Norvig's training data is available from http://norvig.com/big.txt
<b
-------------- next part --------------
A non-text attachment was scrubbed...
Name: train.hs
Type: text/x-haskell
Size: 583 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20070422/fee153f0/train.bin
More information about the Haskell-Cafe
mailing list