[Haskell-cafe] Spelling checker exercise
Daniel Fischer
daniel.is.fischer at web.de
Tue Jan 26 10:13:44 EST 2010
Am Dienstag 26 Januar 2010 14:11:06 schrieb Eduard Sergeev:
> Matthew Phillips-5 wrote:
> > I also found it to to be very slow.
>
> My variant: http://a-ejeli-tak.livejournal.com/1326.html Spellchecker
> in Haskell
> String version runs in 2.5 sec, ByteString in 1.2 sec (just for one word
> e.g. just to build the tree). 8 sec to check input of 400 words (copied
> from Norvig's example).
Slower here, time for building the set is approximately equal to that of of
the frequency map [either String or ByteString], lookup a little slower if
one edit is needed, much faster if two are needed (of course).
But the lazy Levenshtein distance is much faster again, for the 'tests2'
data from Norvig's http://norvig.com/spell.py (400 words),
$ xargs -a tdata.txt time ./nLDBSWSpelling > /dev/null
4.50user 0.03system 0:04.53elapsed 100%CPU
$ time ./esergSpellBS big.txt tdata.txt > /dev/null
28.23user 0.09system 0:28.32elapsed 100%CPU
surprisingly (?), your plain String version is faster for that than your
ByteString version:
$ time ./esergSpellS big.txt tdata.txt > /dev/null
25.07user 0.10system 0:25.18elapsed 99%CPU
> I think laziness helps here to avoid unnecessary
> checks (once the first match is found).
But that's the point, these checks aren't unnecessary (unless the word
under inspection is known). You want to propose the most likely correct
word.
If your input is "arthetic", should you return "aesthetic", just because
it's the first of the (at least four) correct words with edit distance 2[*]
which is produced by your arbitrary ordering of edit steps or it's the
lexicographically smallest?
I think you shouldn't.
> Haven't tried it on a larger data sets neither tried to optimize it.
> Cheated on dictionary parsing though...
[*] The others I know are "bathetic", "pathetic" and "arthritic" - without
context, I'd go for "arthritic" because I think spelling errors are more
common i the middle of a word than at the beginning or at the end, but
plain frequency analysis of the corpus suggests "pathetic".
More information about the Haskell-Cafe
mailing list