[Haskell-cafe] Spelling checker exercise
Matthew Phillips
mattphil at gmail.com
Sun Jan 24 23:34:50 EST 2010
On 24/01/2010, at 10:22 PM, Daniel Fischer wrote:
<...>
> I think I know what happened here:
>
> $ ghc -fforce-recomp --make matthew -o matthew0
<...>
> I habitually compile all code with -O2, unless I have a specific reason not
> to. I tend to forget that some compile without optimisations.
> For some kinds of code that makes hardly any difference, for others it
> makes a big difference.
I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not* -O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would turn on optimisation, but of course that's just the flag for GCC.
$ time ./spelling becuase
becuase -> because
real 0m4.467s
user 0m3.865s
sys 0m0.280s
$ time ./spelling_df becuase
becuase -> because
real 0m2.422s
user 0m2.198s
sys 0m0.081s
Your previous version is close to twice as fast, and now only 2.5 times slower than Python.
<snipped new version of code with toLower removed>
With your suggested changes, your latest version on my machine:
$ time ./spelling_df becuase
becuase -> because
real 0m1.731s
user 0m1.572s
sys 0m0.056s
Now, we're getting close: 4.7s -> 2.3s -> 1.7s.
>>> But once you start needing two edits (try korrekt), correct and edits1
>>> start to show up. That shows that Norvig's algorithm isn't really
>>> good. With two edit steps, you create a _lot_ of strings you need to
>>> look up, far more than there are in the map. That takes time. It'll be
>>> better to scan the map for entries with an edit distance (< 3) if you
>>> have a good method to check that
>
> Indeed:
> $ time ./nLDBSWSpelling becuase
> becuase -> because
> 2.84user 0.02system 0:02.86elapsed 100%CPU
> $ time ./nLDBSWSpelling korrekt
> korrekt -> correct
> 2.83user 0.05system 0:02.88elapsed 100%CPU
Not sure if I see what you're saying here: do you mean to point out the 2.86 vs 2.88 elapsed?
>>> Another thing is
>>>
>>> allWords = keysSet wordCounts
>>>
>>> Ouch. For each correction, you construct that set anew. Just use
>>> member from Data.Map instead of Data.Set.member and look up the words
>>> in the map.
>>
>> Whoops! Just to be clear though: Haskell will memoise the result of
>> "allWords" for a given invocation of "correct"?
>
> Yes. But not across different invocations.
>
>> So this would only make a large difference for multiple corrections?
>
> Right. But that's the interesting use case, isn't it?
It will be when I get the the rest of it working, yes :)
>>>> I will look at using Bloom Filters or
>>>> Trie’s instead of Data.Map, but I wonder if readFile should be taking
>>>> nearly %30 of the run time, even for a 6MB file?
>>>
>>> No way. But it doesn't seem to, from my GHC's point of view.
>>
>> Just to be sure I wasn't using the SCC incorrectly, I split out the
>> readFile call into "myReadFile". The prof line comes out as:
>>
>> myReadFile Main 210 1 35.8 8.6
>> 35.8 8.6
>>
>> i.e. 35.8% of runtime.
>>
>
> Can I see the exact code which gives that profile?
> Somehow, things which shouldn't must be attributed to readFile.
The version at this link has myReadFile split out.
http://github.com/scramjet/spelling/blob/31071edb2166b2bc4d350358900d975228fd43b9/spelling.hs
Doing the same to your version has the same result:
myReadFile Main 210 1 45.7 9.6 45.7 9.6
It does seem that the profiler is wrong or misleading somehow.
Two other quick questions:
(1) you added parentheses to "candidates":
candidates = known [word] `or` ((known e1) `or` known_edits2)
The "or"'s should short circuit so that if "known [word]" is non-empty, the others shouldn't be evaluated. I read the above as forcing evaluation of the second "or" first: am I wrong?
(2) you eliminated the "fold" in "correct" in favour of a tail-recursive search in "maxCount": was this for style or performance reasons (or both :)?
Cheers,
Matthew.
More information about the Haskell-Cafe
mailing list