[Haskell-cafe] Optimizing spelling correction program
wren ng thornton
wren at freegeek.org
Wed Jun 24 19:57:13 EDT 2009
Kamil Dworakowski wrote:
> The timing per expression however
> showed something useful, 70% of the time was being spent around
> determining membership in the known words dictionary (Data.Map). There
> are about 30 000 items in the dictionary.
>
> Right... Python uses hashtables while here I have a tree with log n
> access time. I did not want to use the Data.HashTable, it would
> pervade
> my program with IO. The alternative is an ideal hashmap that never
> gets
> changed. This program creates a dictionary at start which then is only
> being used to read from: an ideal application for the Data.PerfectHash
> by Mark Wotton available on Hackage [3].
>
> The PerfectHash is a dictionary keyed by ByteStrings, so I had to
> migrate the program to use these instead of Strings. Sadly it
> increased
> the running time to 59s. Adding PerfectHash brought the running time
> down to 39s. I have done some code restructuring with the view to
> publishing it, which brought the running time up to 45s. The code is
> available on patch-tag [4] for anyone to scrutinize and hopefully
> point
> out mistakes (it uses big.txt and misspellings file which are linked
> to
> from Norvig's page [1]).
You should also consider using Data.Trie from the bytestring-trie
package. It also uses ByteStrings as the keys, but it uses a big-endian
patricia tree instead of a hash. This removes Data.Map's duplication of
effort for comparing common prefixes of strings. Constructing a
Data.Trie can be somewhat slower than constructing a Data.Map
ByteString, but the lookup is much faster.
If your misspellings file changes infrequently and you're not afraid of
utilities, you can pre-compile the Data.Trie and read it in with
Data.Binary. If it changes really infrequently and you're not afraid of
recompilation, you can use Template Haskell to store the pre-compiled
form directly in the executable.
> Frankly, I am not impressed with the PerfectHash performance; It looks
> as if it is only two times faster than Data.Map. Is this because of
> the number of elements?
Perfection does have its cost.
I haven't played with Data.PerfectHash, but if the hashing function is
slow then it will kill your performance. In general hashing will do too
much work because it looks at the whole ByteString which is usually
unnecessary (though this extra work can be less than the costs of
putting the work off, depending on the dataset).
...
Though honestly, most spellcheckers are more interested in optimizing
for space rather than optimizing for time. For this, you'll want to look
into something like Golomb codes (that is, Huffman codes for
exponentially-distributed infinite-alphabets) as was used in the Unix
"spell" utility back in the day.
Once you've defined a prefix-free code for all possible words, you can
encode each word and pack the result into a ByteString (treated as an
array of bits) and hand it off to Data.Trie which will do the right
thing. Depending on the size of your data, you may want to squash the
Data.Trie down into an array and use an algorithm rather than the
datastructure to do your search (just as binary search algorithms on
arrays actualize data structures like Data.Map (or datastructures like
data.Map reify binary search algorithms, if you prefer)).
--
Live well,
~wren
More information about the Haskell-Cafe
mailing list