[Haskell-cafe] Spelling checker exercise
Matthew Phillips
mattphil at gmail.com
Sun Jan 24 00:19:58 EST 2010
Thanks very much Daniel for giving my (amateurish!) exercise such an in-depth a look-over. Comments inline below.
On 23/01/2010, at 12:11 AM, Daniel Fischer wrote:
> Am Freitag 22 Januar 2010 07:51:27 schrieb Matthew Phillips:
>> Hello all,
>>
>> sorry to bring up an old chestnut, but I’m trying to improve my
>> Haskell-fu by writing a small program, and chose Peter Norvig’s spelling
>> checker as my exercise material (http://norvig.com/spell-correct.html).
>>
>> While I’ve gotten it working, and found it quite illuminating, I also
>> found it to to be very slow. And then I discovered that, of course,
>> others have been here well before me ([1] and [2]). Those discussions
>> were very interesting, but the code they refer to is mostly not
>> available, so the most I’ve gotten out of it so far is that:
>>
>> (a) I should be using strict left folds and strict Map insertions for
>> the word frequency map (this shaved off about a second: ~5s -> ~4s for a
>> single word on my 2.4GHz MacBook Pro, GHC 6.10.4) (b) I should probably
>> be using ByteString’s
>
> That does help, but the worst part is building the map. That takes a couple
> of seconds in Python, too. Just building the map takes 1.95s for Python,
> 3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and
> 6s with plain Strings here.
I get the Python version running in about 1s, compared to 5.6s for the Haskell:
$ time python spelling.py
because
real 0m1.071s
user 0m0.821s
sys 0m0.139s
$ time ./spelling becuase
becuase -> because
real 0m5.589s
user 0m4.554s
sys 0m0.307s
And, strangely, the rewrite you provided (I called it "spelling_df") runs a fair bit slower:
$ time ./spelling_df becuase
becuase -> because
real 0m8.087s
user 0m6.966s
sys 0m0.193s
$ time ./spelling korrekt
korrekt -> correct
real 0m5.970s
user 0m4.885s
sys 0m0.300s
$ time ./spelling_df korrekt
korrekt -> correct
real 0m8.616s
user 0m7.538s
sys 0m0.187s
>> The code is at [3] (link to version at time of post). Profiling [4]
>> shows:
>>
>> $ ./spelling becuase +RTS -p
>> becuase -> because
>> $ cat spelling.prof
>> total time = 4.02 secs (201 ticks @ 20 ms)
>> total alloc = 1,544,257,792 bytes (excludes profiling overheads)
>>
>> COST CENTRE MODULE %time %alloc
>>
>> train Main 52.7 19.7
>> readFile Main 28.9 8.6
>> wordsBy Main 10.9 49.5
>> toLower Main 7.0 21.8
>> ...
>>
>> So it appears that “train" (building the freq map) and “readFile” in
>> “nwords" are the places to hone.
>
> readFile does not appear in my profile.
Apologies, I should have said that I’d inserted some SCC’s to try to tease out the cost of readFile (i.e. {-# SCC "readFile"}).
> If you insert an SCC for updateMap,
>
> where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1
> model
>
> , you'll see that the really bad citizen is updateMap (splitWords is rather
> bad, too, together they take some 95% of the time in that profile).
Maybe I'm doing this wrong, but I see "splitWords" in spelling_df taking 80% of runtime. Adding SCC's like this:
splitWords = {-# SCC "filter" #-} filter (not . B.null) . {-# SCC "splitWith" #-} B.splitWith isNogud . {-# SCC "toLower" #-} B.map toLower
gives me:
splitWords Main 216 1 0.0 0.0 78.6 91.8
filter Main 217 1 1.9 3.0 78.6 91.8
splitWith Main 218 1 28.4 36.8 76.7 88.8
isNogud Main 221 6488666 4.2 4.1 4.2 4.1
toLower Main 219 1 44.2 47.9 44.2 47.9
i.e. it seems that "splitWith" and "toLower" (!) are the culprits. Why, I have no idea.
Am I reading this wrong?
> But once you start needing two edits (try korrekt), correct and edits1
> start to show up. That shows that Norvig's algorithm isn't really good.
> With two edit steps, you create a _lot_ of strings you need to look up, far
> more than there are in the map. That takes time. It'll be better to scan
> the map for entries with an edit distance (< 3) if you have a good method
> to check that
> (http://old.nabble.com/haskell-in-online-contests-td26546989.html contains
> pointers for that).
Will have a look at that: it looks like it'll be very informative.
> Another thing is
>
> allWords = keysSet wordCounts
>
> Ouch. For each correction, you construct that set anew. Just use member
> from Data.Map instead of Data.Set.member and look up the words in the map.
Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"? So this would only make a large difference for multiple corrections? (which I wasn't worrying about for the moment). The change seems to wipe off about 0.2s on average.
>> I will look at using Bloom Filters or
>> Trie’s instead of Data.Map, but I wonder if readFile should be taking
>> nearly %30 of the run time, even for a 6MB file?
>
> No way. But it doesn't seem to, from my GHC's point of view.
Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as:
myReadFile Main 210 1 35.8 8.6 35.8 8.6
i.e. 35.8% of runtime.
>> Sorry to dump such a long post on the list — I’ll understand if no one
>> can be bothered rehashing this. But, in summary I’d like to know:
>>
>> (a) how could I use ByteString’s for this to speed up I/O and reduce
>> memory usage without losing the nice readability?
And thanks for the other small tips (e.g. findWithDefault), and I didn't know you could use let the way you did either.
Cheers,
Matthew.
More information about the Haskell-Cafe
mailing list