[Haskell-cafe] Spelling checker exercise

Fri Jan 22 01:51:27 EST 2010

Hello all,

sorry to bring up an old chestnut, but I’m trying to improve my Haskell-fu by writing a small program, and chose Peter Norvig’s spelling checker as my exercise material (http://norvig.com/spell-correct.html).

While I’ve gotten it working, and found it quite illuminating, I also found it to to be very slow. And then I discovered that, of course, others have been here well before me ([1] and [2]). Those discussions were very interesting, but the code they refer to is mostly not available, so the most I’ve gotten out of it so far is that:

  (a) I should be using strict left folds and strict Map insertions for the word frequency map (this shaved off about a second: ~5s -> ~4s for a single word on my 2.4GHz MacBook Pro, GHC 6.10.4)
  (b) I should probably be using ByteString’s
  (c) Using Set’s for the edit permutations probably isn’t worth it (although I found using plain lists made it about a second slower)

(b) is difficult because I’ve used matching patterns plus list comprehensions to generate the potential edits, and I really like how elegantly it pans out that way. Because ByteString’s are not lists, I can’t see a way to keep the current structure and use them.

The code is at [3] (link to version at time of post). Profiling [4] shows:

  $ ./spelling becuase +RTS -p 
  becuase -> because
  $ cat spelling.prof
    total time =  4.02 secs (201 ticks @ 20 ms) 
    total alloc = 1,544,257,792 bytes (excludes profiling overheads) 

  COST CENTRE                    MODULE               %time %alloc

  train                          Main                  52.7   19.7
  readFile                       Main                  28.9    8.6
  wordsBy                        Main                  10.9   49.5
  toLower                        Main                   7.0   21.8
  ...

So it appears that “train" (building the freq map) and “readFile” in “nwords" are the places to hone. I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?

Sorry to dump such a long post on the list — I’ll understand if no one can be bothered rehashing this. But, in summary I’d like to know:

  (a) how could I use ByteString’s for this to speed up I/O and reduce memory usage without losing the nice readability?

  (b) should readFile be so slow?

  (c) any other tips

Possibly all my questions could be answered if someone has the code from the old posts.

Cheers,

Matthew.

[1]: http://haskell.markmail.org/search/?q=norvig%20spelling#query:norvig%20spelling+page:1+mid:jrvpyjpms5dumd3k+state:results

[2]: http://thread.gmane.org/gmane.comp.lang.haskell.cafe/21780

[3]: http://github.com/scramjet/spelling/blob/44fd336ef4f62d49c7087dbc1ffb9c009b78cec0/spelling.hs

[4]: http://github.com/scramjet/spelling/blob/f5146b24b77443c22975ff69fb91aa922cc1d4a3/spelling.prof