[Haskell-cafe] Spelling checker exercise

Tue Jan 26 10:13:44 EST 2010

Am Dienstag 26 Januar 2010 14:11:06 schrieb Eduard Sergeev:
> Matthew Phillips-5 wrote:
> > I also found it to to be very slow.
>
> My variant:  http://a-ejeli-tak.livejournal.com/1326.html Spellchecker
> in Haskell
> String version runs in 2.5 sec, ByteString in 1.2 sec (just for one word
> e.g. just to build the tree). 8 sec to check input of 400 words (copied
> from Norvig's example).

Slower here, time for building the set is approximately equal to that of of 
the frequency map [either String or ByteString], lookup a little slower if 
one edit is needed, much faster if two are needed (of course).
But the lazy Levenshtein distance is much faster again, for the 'tests2' 
data from Norvig's http://norvig.com/spell.py (400 words), 

$ xargs -a tdata.txt time ./nLDBSWSpelling > /dev/null
4.50user 0.03system 0:04.53elapsed 100%CPU
$ time ./esergSpellBS big.txt tdata.txt > /dev/null
28.23user 0.09system 0:28.32elapsed 100%CPU

surprisingly (?), your plain String version is faster for that than your 
ByteString version:

$ time ./esergSpellS big.txt tdata.txt > /dev/null
25.07user 0.10system 0:25.18elapsed 99%CPU

> I think laziness helps here to avoid unnecessary
> checks (once the first match is found).

But that's the point, these checks aren't unnecessary (unless the word 
under inspection is known). You want to propose the most likely correct 
word.

If your input is "arthetic", should you return "aesthetic", just because 
it's the first of the (at least four) correct words with edit distance 2[*] 
which is produced by your arbitrary ordering of edit steps or it's the 
lexicographically smallest?

I think you shouldn't.

> Haven't tried it on a larger data sets neither tried to optimize it.
> Cheated on dictionary parsing though...

[*] The others I know are "bathetic", "pathetic" and "arthritic" - without 
context, I'd go for "arthritic" because I think spelling errors are more 
common i the middle of a word than at the beginning or at the end, but 
plain frequency analysis of the corpus suggests "pathetic".