[Haskell-cafe] Optimizing spelling correction program

Mon Jun 22 05:03:43 EDT 2009

Hey, you're using String I/O!

nWORDS <- fmap (train . map B.pack . words) (readFile "big.txt")

This should be

WORDS <- fmap (train . B.words) (B.readFile "big.txt")

By the way, which exact file do you use as a misspellings file? The
corpus linked to at Norvig's page has many.
And do you have a driver program that I could run and obtain your timings?

2009/6/22 Kamil Dworakowski <kamil at dworakowski.name>:
> Hi all,
>
> I want to learn how to optimize Haskell code on the Norvig's spelling
> correction program [1]. Peter Norvig has written the program in
> Python,
> quite a literate translation to Haskell (thanks to Grzegorz Chrupala)
> [2] was also available.
>
> Both versions are about 23 LOCs, and Haskell version is four times
> slower, 2m25s versus 36s. All the numbers I give come from running the
> program on a list of 806 misspellings, Haskell version compiled with -
> O2
> and -fforce-recomp flags.
>
> I started with trial and error approach. Just preventing a repetitive
> call to keysSet brought the running time down to 1m48s. I also
> restructured the edits1 function and dropped all uses of sets, which
> haven't been pulling their own weight. This brought the running time
> down to 55s. Not quite there yet but close.
>
> Reading the Profiling chapter in the RWH book enabled me to do some
> more
> informed optimizing. At this point the GC time was about 2% of the
> overall runnig time, Grzegorz has done a good job of using strict
> versions of functions where necessary. The timing per expression
> however
> showed something useful, 70% of the time was being spent around
> determining membership in the known words dictionary (Data.Map). There
> are about 30 000 items in the dictionary.
>
> Right... Python uses hashtables while here I have a tree with log n
> access time. I did not want to use the Data.HashTable, it would
> pervade
> my program with IO. The alternative is an ideal hashmap that never
> gets
> changed. This program creates a dictionary at start which then is only
> being used to read from: an ideal application for the Data.PerfectHash
> by Mark Wotton available on Hackage [3].
>
> The PerfectHash is a dictionary keyed by ByteStrings, so I had to
> migrate the program to use these instead of Strings. Sadly it
> increased
> the running time to 59s. Adding PerfectHash brought the running time
> down to 39s. I have done some code restructuring with the view to
> publishing it, which brought the running time up to 45s. The code is
> available on patch-tag [4] for anyone to scrutinize and hopefully
> point
> out mistakes (it uses big.txt and misspellings file which are linked
> to
> from Norvig's page [1]).
>
> At this point [5] I don't know how to proceed. The program is still
> slower than Python. The -ddump-simpl dump does not give me any clues,
> there is too much of it and I don't understand most of it. The GC time
> is about 10% of the total, and according to the profiler almost half
> the
> time is spent in the member function, and almost all the rest in
> edits1.
> Shouldn't it run much faster than the interpreted Python?
>
> Frankly, I am not impressed with the PerfectHash performance; It looks
> as if it is only two times faster than Data.Map. Is this because of
> the
> number of elements?
>
> 1. http://www.norvig.com/spell-correct.html
> 2. http://pitekus.blogspot.com/2007/04/norvigs-spelling-corrector-in-haskell.html
> 3. http://hackage.haskell.org/package/PerfectHash
> 4. http://patch-tag.com/r/spellcorrect/
> 5.
> http://patch-tag.com/r/spellcorrect/snapshot/hash/20090621192727-a99e5-fed48a7ccf07b79c572fddb0304648fe80d39904/content/pretty/SpellingCorrection.hs
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

-- 
Eugene Kirpichov
Web IR developer, market.yandex.ru