[Haskell-cafe] Optimizing a title matcher
Lyle Kopnicky
lists at qseep.net
Tue Sep 26 16:44:05 EDT 2006
Hi folks,
I'm competing in a contest at work, and we're allowed to use whatever
language we want. I decided this was my chance to prove to people that
Haskell was up to the challenge. Unfortunately, I ran into performance
problems. Since the contest ends this Friday, I've decided to switch to
C++ (gasp!). But if any of you have advice on how to speed up this code,
it could help me advocate Haskell in the future.
It's supposed to match movie titles from an imported database to a
reference database. The version I've sent doesn't do anything very smart
- it's just doing literal title matches. The first argument to the
program is the filename of the table to be imported, and the second is
the filename of the reference table. The first line of each table is a
pipe-separated list of field names; the rest of the lines are records,
each a pipe-separated list of values.
The import files each have 3,000 records, and the reference table has
137,986 records.
Building the hash tables out of the files is quick - it just takes a few
seconds. But doing the matching of title_id in one table to title_id in
the other, in a nested loop between both tables, takes way too long.
It's matching two import titles (against each of the reference titles)
per second. It needs to do at least 20 per second to qualify for the
contest, and it's not doing anything fancy yet.
I tried various "improvements" to speed it up. One was to specifically
use ByteString, eliminating the AbsString class. Didn't make a
difference. Another was to use arrays instead of lists to store each
record, and precompute the indices of each of the fields within those
records. I also iterated over a list of keys instead of the list of
Maps, and only converted each record to a Map one at a time, hoping they
would be disposed of sooner. Instead of speeding up the program, this
slowed it down by a factor of 20!
I've profiled it, and I can't make much out of that. It seemed to be
spending 25% of its time doing scoring, and I though the problem must be
due to laziness, but I'm not sure.
So if anyone has any ideas how to speed this up by a factor of at least
10 times, it would be really appreciated! Even the Ruby solutions are
doing that, which is embarrassing.
Thanks,
Lyle
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AbsString.hs
Type: text/x-haskell
Size: 2370 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20060926/04f408b1/AbsString.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TextTable.hs
Type: text/x-haskell
Size: 2769 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20060926/04f408b1/TextTable.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: title_match_orig.hs
Type: text/x-haskell
Size: 2336 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20060926/04f408b1/title_match_orig.bin
More information about the Haskell-Cafe
mailing list