[Haskell-cafe] Optimizing a title matcher

Tue Sep 26 17:22:01 EDT 2006

On Tuesday 26 September 2006 16:44, Lyle Kopnicky wrote:
> Hi folks,
>
> I'm competing in a contest at work, and we're allowed to use whatever
> language we want. I decided this was my chance to prove to people that
> Haskell was up to the challenge. Unfortunately, I ran into performance
> problems. Since the contest ends this Friday, I've decided to switch to
> C++ (gasp!). But if any of you have advice on how to speed up this code,
> it could help me advocate Haskell in the future.
>
> It's supposed to match movie titles from an imported database to a
> reference database. The version I've sent doesn't do anything very smart
> - it's just doing literal title matches. The first argument to the
> program is the filename of the table to be imported, and the second is
> the filename of the reference table. The first line of each table is a
> pipe-separated list of field names; the rest of the lines are records,
> each a pipe-separated list of values.
>
> The import files each have 3,000 records, and the reference table has
> 137,986 records.
>
> Building the hash tables out of the files is quick - it just takes a few
> seconds. But doing the matching of title_id in one table to title_id in
> the other, in a nested loop between both tables, takes way too long.
> It's matching two import titles (against each of the reference titles)
> per second. It needs to do at least 20 per second to qualify for the
> contest, and it's not doing anything fancy yet.

Humm... well, double nested loops seems like the wrong approach.  Also, if you 
are using GHC, it's hashtable implementation has farily well-known 
performance problems. If all you care about is exact matching, then the 
operation is essentially a finite map intersection (if I've understood what 
you are trying to do).

This is just a guess, but I suspect you will probably get much better 
performance (and better-looking code!) by just using Data.Map.intersection

http://www.haskell.org/ghc/docs/latest/html/libraries/base/Data-Map.html#v%3Aintersection

Alternately, there is the ternary trie implementation from Edison 
(http://www.eecs.tufts.edu/~rdocki01) that may also work for you.

If you need to do prefix matching, then a trie is the way to go.  You can 
probably code up a nice prefix-intersection operation using tries that should 
go pretty fast.

If you have some other metric other than prefix in mind for partial matches, 
then things probably get a lot more complicated.  You're probably looking at 
calculating minimum distances in some feature-space, which calls for pretty 
sophisticated algorithms if you need good performance.

> I tried various "improvements" to speed it up. One was to specifically
> use ByteString, eliminating the AbsString class. Didn't make a
> difference. Another was to use arrays instead of lists to store each
> record, and precompute the indices of each of the fields within those
> records. I also iterated over a list of keys instead of the list of
> Maps, and only converted each record to a Map one at a time, hoping they
> would be disposed of sooner. Instead of speeding up the program, this
> slowed it down by a factor of 20!
>
> I've profiled it, and I can't make much out of that. It seemed to be
> spending 25% of its time doing scoring, and I though the problem must be
> due to laziness, but I'm not sure.
>
> So if anyone has any ideas how to speed this up by a factor of at least
> 10 times, it would be really appreciated! Even the Ruby solutions are
> doing that, which is embarrassing.
>
> Thanks,
> Lyle

-- 
Rob Dockins

Talk softly and drive a Sherman tank.
Laugh hard, it's a long way to the bank.
       -- TMBG