[Haskell-cafe] Optimizing spelling correction program

Wed Jun 24 19:57:13 EDT 2009

Kamil Dworakowski wrote:
> The timing per expression however
> showed something useful, 70% of the time was being spent around
> determining membership in the known words dictionary (Data.Map). There
> are about 30 000 items in the dictionary.
> 
> Right... Python uses hashtables while here I have a tree with log n
> access time. I did not want to use the Data.HashTable, it would
> pervade
> my program with IO. The alternative is an ideal hashmap that never
> gets
> changed. This program creates a dictionary at start which then is only
> being used to read from: an ideal application for the Data.PerfectHash
> by Mark Wotton available on Hackage [3].
> 
> The PerfectHash is a dictionary keyed by ByteStrings, so I had to
> migrate the program to use these instead of Strings. Sadly it
> increased
> the running time to 59s. Adding PerfectHash brought the running time
> down to 39s. I have done some code restructuring with the view to
> publishing it, which brought the running time up to 45s. The code is
> available on patch-tag [4] for anyone to scrutinize and hopefully
> point
> out mistakes (it uses big.txt and misspellings file which are linked
> to
> from Norvig's page [1]).

You should also consider using Data.Trie from the bytestring-trie 
package. It also uses ByteStrings as the keys, but it uses a big-endian 
patricia tree instead of a hash. This removes Data.Map's duplication of 
effort for comparing common prefixes of strings. Constructing a 
Data.Trie can be somewhat slower than constructing a Data.Map 
ByteString, but the lookup is much faster.

If your misspellings file changes infrequently and you're not afraid of 
utilities, you can pre-compile the Data.Trie and read it in with 
Data.Binary. If it changes really infrequently and you're not afraid of 
recompilation, you can use Template Haskell to store the pre-compiled 
form directly in the executable.

> Frankly, I am not impressed with the PerfectHash performance; It looks
> as if it is only two times faster than Data.Map. Is this because of
> the number of elements?

Perfection does have its cost.

I haven't played with Data.PerfectHash, but if the hashing function is 
slow then it will kill your performance. In general hashing will do too 
much work because it looks at the whole ByteString which is usually 
unnecessary (though this extra work can be less than the costs of 
putting the work off, depending on the dataset).

...

Though honestly, most spellcheckers are more interested in optimizing 
for space rather than optimizing for time. For this, you'll want to look 
into something like Golomb codes (that is, Huffman codes for 
exponentially-distributed infinite-alphabets) as was used in the Unix 
"spell" utility back in the day.

Once you've defined a prefix-free code for all possible words, you can 
encode each word and pack the result into a ByteString (treated as an 
array of bits) and hand it off to Data.Trie which will do the right 
thing. Depending on the size of your data, you may want to squash the 
Data.Trie down into an array and use an algorithm rather than the 
datastructure to do your search (just as binary search algorithms on 
arrays actualize data structures like Data.Map (or datastructures like 
data.Map reify binary search algorithms, if you prefer)).

-- 
Live well,
~wren