[Haskell-cafe] Spelling checker exercise
Daniel Fischer
daniel.is.fischer at web.de
Mon Jan 25 02:19:59 EST 2010
Am Montag 25 Januar 2010 05:34:50 schrieb Matthew Phillips:
> On 24/01/2010, at 10:22 PM, Daniel Fischer wrote:
>
> <...>
>
> > I think I know what happened here:
> >
> > $ ghc -fforce-recomp --make matthew -o matthew0
>
> <...>
>
> > I habitually compile all code with -O2, unless I have a specific
> > reason not to. I tend to forget that some compile without
> > optimisations. For some kinds of code that makes hardly any
> > difference, for others it makes a big difference.
>
> I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not*
> -O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would
> turn on optimisation, but of course that's just the flag for GCC.
By the way, compiling via C nowadays hardly ever produces faster code than
the native code generator (most times I tried since 6.10, if there was a
difference, native was faster).
>
> $ time ./spelling becuase
> becuase -> because
>
> real 0m4.467s
> user 0m3.865s
> sys 0m0.280s
>
> $ time ./spelling_df becuase
> becuase -> because
>
> real 0m2.422s
> user 0m2.198s
> sys 0m0.081s
>
> Your previous version is close to twice as fast, and now only 2.5 times
> slower than Python.
>
> <snipped new version of code with toLower removed>
>
> With your suggested changes, your latest version on my machine:
>
> $ time ./spelling_df becuase
> becuase -> because
>
> real 0m1.731s
> user 0m1.572s
> sys 0m0.056s
>
> Now, we're getting close: 4.7s -> 2.3s -> 1.7s.
>
> >>> But once you start needing two edits (try korrekt), correct and
> >>> edits1 start to show up. That shows that Norvig's algorithm isn't
> >>> really good. With two edit steps, you create a _lot_ of strings you
> >>> need to look up, far more than there are in the map. That takes
> >>> time. It'll be better to scan the map for entries with an edit
> >>> distance (< 3) if you have a good method to check that
> >
> > Indeed:
> > $ time ./nLDBSWSpelling becuase
> > becuase -> because
> > 2.84user 0.02system 0:02.86elapsed 100%CPU
> > $ time ./nLDBSWSpelling korrekt
> > korrekt -> correct
> > 2.83user 0.05system 0:02.88elapsed 100%CPU
>
> Not sure if I see what you're saying here: do you mean to point out the
> 2.86 vs 2.88 elapsed?
Well, above the code, I had the times for Norvig's algorithm (creating all
two-step edits and checking which are known):
> > And, without profiling:
> > $ time ./spellingBSW becuase
> > becuase -> because
> > 2.84user 0.03system 0:02.88elapsed 99%CPU
> >
> > Finally, building the set of two-step edits takes longer than the
> > additional lookups:
> >
> > $ time ./spellingBSW becuase
> > becuase -> because
> > 2.84user 0.03system 0:02.88elapsed 99%CPU
> > $ time ./spellingBSW korrekt
> > korrekt -> correct
> > 3.50user 0.02system 0:03.52elapsed 100%CPU
> >
> > vs.
> >
> > $ time ./spellingBSWL becuase
> > becuase -> because
> > 2.79user 0.04system 0:02.83elapsed 100%CPU
> > $ time ./spellingBSWL3 korrekt
> > korrekt -> correct
> > 3.20user 0.02system 0:03.23elapsed 99%CPU
For "becuase", which is a one-step edit, all take the same time (within
normal fluctuation), 2.8x seconds.
For the two-step edit "korrekt", the version building Sets takes 3.5
seconds, the version which doesn't build a Set of two-step edits takes 3.2
seconds, *and the version scanning the Map of known words for entries with
a Levenshtein distance [modified to account for transpositions] less than
3, takes 2.8y seconds, practically the same time as for the one-step edit*.
It becomes more obvious, perhaps, if we test a couple of two-steppers:
Lazy Levenshtein:
time ./nLDBSWSpelling becrase korrekt halmos territoir gutzenperg
becrase -> because
korrekt -> correct
halmos -> holmes
territoir -> territory
gutzenperg -> gutenberg
2.94user 0.03system 0:02.97elapsed 100%CPU
just something like 0.1 - 0.15 seconds more than for "becuase", that makes
0.02 - 0.04 seconds per two-edit correction.
Sets:
$ time ./spellingBSW becrase korrekt halmos territoir gutzenperg
becrase -> because
korrekt -> correct
halmos -> holmes
territoir -> territory
gutzenperg -> gutenberg
7.48user 0.03system 0:07.55elapsed 99%CPU
Gewalt geschrien! That takes almost a second per two-edit correction.
List:
$ time ./spellingBSWL3 becrase korrekt halmos territoir gutzenperg
becrase -> because
korrekt -> correct
halmos -> holmes
territoir -> territory
gutzenperg -> gutenberg
5.54user 0.04system 0:05.58elapsed 99%CPU
Better, but still bad, roughly half a second per correction.
Python without psyco:
$ time python ./norvig.py becrase korrekt halmos territoir gutzenperg
Trained in 1.46993017197 seconds
because
correct
holmes
territory
gutenberg
3.00user 0.08system 0:03.09elapsed 99%CPU
$ time python ./norvig.py becuase
Trained in 1.49027395248 seconds
because
1.46user 0.08system 0:01.55elapsed 100%CPU
about 0.3 seconds per correction
and with:
$ time python ./norvig.py becrase korrekt halmos territoir gutzenperg
Trained in 1.22417902946 seconds
because
correct
holmes
territory
gutenberg
2.12user 0.09system 0:02.21elapsed 100%CPU
$ time python ./norvig.py becuase
Trained in 1.17486715317 seconds
because
1.14user 0.08system 0:01.23elapsed 99%CPU
about 0.2 seconds per correction.
(Just for the record, I found out what Python did in the half second I
couldn't account for: it looked for matches for "./norvig.py", I forgot
that Python passes the name of the script in argv[0] - d'oh)
>
> >>> Another thing is
> >>>
> >>> allWords = keysSet wordCounts
> >>>
> >>> Ouch. For each correction, you construct that set anew. Just use
> >>> member from Data.Map instead of Data.Set.member and look up the
> >>> words in the map.
> >>
> >> Whoops! Just to be clear though: Haskell will memoise the result of
> >> "allWords" for a given invocation of "correct"?
> >
> > Yes. But not across different invocations.
> >
> >> So this would only make a large difference for multiple corrections?
> >
> > Right. But that's the interesting use case, isn't it?
>
> It will be when I get the the rest of it working, yes :)
>
> >>>> I will look at using Bloom Filters or
> >>>> Trie’s instead of Data.Map, but I wonder if readFile should be
> >>>> taking nearly %30 of the run time, even for a 6MB file?
> >>>
> >>> No way. But it doesn't seem to, from my GHC's point of view.
> >>
> >> Just to be sure I wasn't using the SCC incorrectly, I split out the
> >> readFile call into "myReadFile". The prof line comes out as:
> >>
> >> myReadFile Main 210 1 35.8 8.6
> >> 35.8 8.6
> >>
> >> i.e. 35.8% of runtime.
> >
> > Can I see the exact code which gives that profile?
> > Somehow, things which shouldn't must be attributed to readFile.
>
> The version at this link has myReadFile split out.
>
> http://github.com/scramjet/spelling/blob/31071edb2166b2bc4d350358900d975
>228fd43b9/spelling.hs
>
> Doing the same to your version has the same result:
>
> myReadFile Main 210 1 45.7 9.6 45.7
> 9.6
>
> It does seem that the profiler is wrong or misleading somehow.
Strange. Doing that here has no such effect. The only line in the profile
(ByteString) where myReadFile occurs is
myReadFile Main 260 1 0.0 0.9 0.0 0.9 0 6507348
(snipped whitespace), entered once, allocated 6.5 million bytes, took not
measurable time.
Both, compiled via C and with the native code generator.
With String IO, it costs 4.6% time and 8.1% allocation.
>
> Two other quick questions:
>
> (1) you added parentheses to "candidates":
>
> candidates = known [word] `or` ((known e1) `or` known_edits2)
>
> The "or"'s should short circuit so that if "known [word]" is non-empty,
> the others shouldn't be evaluated. I read the above as forcing
> evaluation of the second "or" first: am I wrong?
Au contraire,
"Any operator lacking a fixity declaration is assumed to be infixl 9"
says the report, so without parentheses, it's parsed as
(known [word] `or` known e1) `or` known_edits2
and both 'or's must be evaluated even if word is known. Since 'or' is lazy
in its second argument, if we associate it
known [word] `or` (known e1 `or` known_edits2)
, in case of a known word, we needn't look at the parenthesis at all (it
will probably not make a measurable difference in running time unless you
check a couple of million known words, but still, it feels lazier). For a
local definition, I thought an explicit fixity declaration was overkill.
>
> (2) you eliminated the "fold" in "correct" in favour of a tail-recursive
> search in "maxCount": was this for style or performance reasons (or both
> :)?
Performance, kind of. Since the lists we fold over are in fact short, it
doesn't really matter, but if they were long, there'd be the risk of
unnecessarily allocating lots of pairs. I'm rather confident that with -O2,
GHC will eliminate the pairs anyway, but without looking at the core, I'm
not entirely sure.
However, in fact it was an unthought-through rewrite because I just had
someone with a stack overflow in spite of foldl' due to the laziness of the
data constructors[*]. So I made sure that that couldn't happen, without
looking at the folded function to see if that already prevents it. And in
fact, it does, so using a foldl' if you use lists instead of Sets is fine,
with respect to style, even preferable.
[*] classic example: why will
average xs = sum / len
where
(sum,len) = foldl' accum (0,0) xs
accum (sm,ln) x = (sm+x,ln+1)
cause a stack overflow for long lists?
>
> Cheers,
>
> Matthew.
More information about the Haskell-Cafe
mailing list