[Haskell-cafe] Optimizing a title matcher
robdockins at fastmail.fm
Tue Sep 26 17:22:01 EDT 2006
On Tuesday 26 September 2006 16:44, Lyle Kopnicky wrote:
> Hi folks,
> I'm competing in a contest at work, and we're allowed to use whatever
> language we want. I decided this was my chance to prove to people that
> Haskell was up to the challenge. Unfortunately, I ran into performance
> problems. Since the contest ends this Friday, I've decided to switch to
> C++ (gasp!). But if any of you have advice on how to speed up this code,
> it could help me advocate Haskell in the future.
> It's supposed to match movie titles from an imported database to a
> reference database. The version I've sent doesn't do anything very smart
> - it's just doing literal title matches. The first argument to the
> program is the filename of the table to be imported, and the second is
> the filename of the reference table. The first line of each table is a
> pipe-separated list of field names; the rest of the lines are records,
> each a pipe-separated list of values.
> The import files each have 3,000 records, and the reference table has
> 137,986 records.
> Building the hash tables out of the files is quick - it just takes a few
> seconds. But doing the matching of title_id in one table to title_id in
> the other, in a nested loop between both tables, takes way too long.
> It's matching two import titles (against each of the reference titles)
> per second. It needs to do at least 20 per second to qualify for the
> contest, and it's not doing anything fancy yet.
Humm... well, double nested loops seems like the wrong approach. Also, if you
are using GHC, it's hashtable implementation has farily well-known
performance problems. If all you care about is exact matching, then the
operation is essentially a finite map intersection (if I've understood what
you are trying to do).
This is just a guess, but I suspect you will probably get much better
performance (and better-looking code!) by just using Data.Map.intersection
Alternately, there is the ternary trie implementation from Edison
(http://www.eecs.tufts.edu/~rdocki01) that may also work for you.
If you need to do prefix matching, then a trie is the way to go. You can
probably code up a nice prefix-intersection operation using tries that should
go pretty fast.
If you have some other metric other than prefix in mind for partial matches,
then things probably get a lot more complicated. You're probably looking at
calculating minimum distances in some feature-space, which calls for pretty
sophisticated algorithms if you need good performance.
> I tried various "improvements" to speed it up. One was to specifically
> use ByteString, eliminating the AbsString class. Didn't make a
> difference. Another was to use arrays instead of lists to store each
> record, and precompute the indices of each of the fields within those
> records. I also iterated over a list of keys instead of the list of
> Maps, and only converted each record to a Map one at a time, hoping they
> would be disposed of sooner. Instead of speeding up the program, this
> slowed it down by a factor of 20!
> I've profiled it, and I can't make much out of that. It seemed to be
> spending 25% of its time doing scoring, and I though the problem must be
> due to laziness, but I'm not sure.
> So if anyone has any ideas how to speed this up by a factor of at least
> 10 times, it would be really appreciated! Even the Ruby solutions are
> doing that, which is embarrassing.
Talk softly and drive a Sherman tank.
Laugh hard, it's a long way to the bank.
More information about the Haskell-Cafe