[Haskell-cafe] Spelling checker exercise

Sun Jan 24 00:19:58 EST 2010

Thanks very much Daniel for giving my (amateurish!) exercise such an in-depth a look-over. Comments inline below.

On 23/01/2010, at 12:11 AM, Daniel Fischer wrote:

> Am Freitag 22 Januar 2010 07:51:27 schrieb Matthew Phillips:
>> Hello all,
>> 
>> sorry to bring up an old chestnut, but I’m trying to improve my
>> Haskell-fu by writing a small program, and chose Peter Norvig’s spelling
>> checker as my exercise material (http://norvig.com/spell-correct.html).
>> 
>> While I’ve gotten it working, and found it quite illuminating, I also
>> found it to to be very slow. And then I discovered that, of course,
>> others have been here well before me ([1] and [2]). Those discussions
>> were very interesting, but the code they refer to is mostly not
>> available, so the most I’ve gotten out of it so far is that:
>> 
>>  (a) I should be using strict left folds and strict Map insertions for
>> the word frequency map (this shaved off about a second: ~5s -> ~4s for a
>> single word on my 2.4GHz MacBook Pro, GHC 6.10.4) (b) I should probably
>> be using ByteString’s
> 
> That does help, but the worst part is building the map. That takes a couple 
> of seconds in Python, too. Just building the map takes 1.95s for Python, 
> 3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and 
> 6s with plain Strings here.

I get the Python version running in about 1s, compared to 5.6s for the Haskell:

$ time python spelling.py 
because

real	0m1.071s
user	0m0.821s
sys	0m0.139s

$ time ./spelling becuase
becuase -> because

real	0m5.589s
user	0m4.554s
sys	0m0.307s

And, strangely, the rewrite you provided (I called it "spelling_df") runs a fair bit slower:

$ time ./spelling_df becuase
becuase -> because

real	0m8.087s
user	0m6.966s
sys	0m0.193s

$ time ./spelling korrekt
korrekt -> correct

real	0m5.970s
user	0m4.885s
sys	0m0.300s

$ time ./spelling_df korrekt
korrekt -> correct

real	0m8.616s
user	0m7.538s
sys	0m0.187s

>> The code is at [3] (link to version at time of post). Profiling [4]

>> shows:
>> 
>>  $ ./spelling becuase +RTS -p
>>  becuase -> because
>>  $ cat spelling.prof
>>    total time =  4.02 secs (201 ticks @ 20 ms)
>>    total alloc = 1,544,257,792 bytes (excludes profiling overheads)
>> 
>>  COST CENTRE                    MODULE               %time %alloc
>> 
>>  train                          Main                  52.7   19.7
>>  readFile                       Main                  28.9    8.6
>>  wordsBy                        Main                  10.9   49.5
>>  toLower                        Main                   7.0   21.8
>>  ...
>> 
>> So it appears that “train" (building the freq map) and “readFile” in
>> “nwords" are the places to hone.
> 
> readFile does not appear in my profile.

Apologies, I should have said that I’d inserted some SCC’s to try to tease out the cost of readFile (i.e. {-# SCC "readFile"}).

> If you insert an SCC for updateMap,
> 
> where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1 
> model
> 
> , you'll see that the really bad citizen is updateMap (splitWords is rather 
> bad, too, together they take some 95% of the time in that profile).

Maybe I'm doing this wrong, but I see "splitWords" in spelling_df taking 80% of runtime. Adding SCC's like this:

  splitWords = {-# SCC "filter" #-} filter (not . B.null) . {-# SCC "splitWith" #-} B.splitWith isNogud . {-# SCC "toLower" #-} B.map toLower

gives me:

  splitWords             Main     216           1   0.0    0.0    78.6   91.8
   filter                Main     217           1   1.9    3.0    78.6   91.8
    splitWith            Main     218           1  28.4   36.8    76.7   88.8
     isNogud             Main     221     6488666   4.2    4.1     4.2    4.1
     toLower             Main     219           1  44.2   47.9    44.2   47.9

i.e. it seems that "splitWith" and "toLower" (!) are the culprits. Why, I have no idea.

Am I reading this wrong?

> But once you start needing two edits (try korrekt), correct and edits1 
> start to show up. That shows that Norvig's algorithm isn't really good.
> With two edit steps, you create a _lot_ of strings you need to look up, far 
> more than there are in the map. That takes time. It'll be better to scan 
> the map for entries with an edit distance (< 3) if you have a good method 
> to check that 
> (http://old.nabble.com/haskell-in-online-contests-td26546989.html contains 
> pointers for that).

Will have a look at that: it looks like it'll be very informative.

> Another thing is
> 
> allWords = keysSet wordCounts
> 
> Ouch. For each correction, you construct that set anew. Just use member 
> from Data.Map instead of Data.Set.member and look up the words in the map.

Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"? So this would only make a large difference for multiple corrections? (which I wasn't worrying about for the moment). The change seems to wipe off about 0.2s on average.

>> I will look at using Bloom Filters or
>> Trie’s instead of Data.Map, but I wonder if readFile should be taking
>> nearly %30 of the run time, even for a 6MB file?
> 
> No way. But it doesn't seem to, from my GHC's point of view.

Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as:

  myReadFile            Main         210           1  35.8    8.6    35.8    8.6

i.e. 35.8%  of runtime.

>> Sorry to dump such a long post on the list — I’ll understand if no one
>> can be bothered rehashing this. But, in summary I’d like to know:
>> 
>>  (a) how could I use ByteString’s for this to speed up I/O and reduce
>> memory usage without losing the nice readability?

And thanks for the other small tips (e.g. findWithDefault), and I didn't know you could use let the way you did either.

Cheers,

Matthew.