[Haskell-cafe] Optimizing a title matcher

Tue Sep 26 16:47:15 EDT 2006

On 9/26/06, Lyle Kopnicky <lists at qseep.net> wrote:
> Hi folks,
>
> I'm competing in a contest at work, and we're allowed to use whatever
> language we want. I decided this was my chance to prove to people that
> Haskell was up to the challenge. Unfortunately, I ran into performance
> problems. Since the contest ends this Friday, I've decided to switch to
> C++ (gasp!). But if any of you have advice on how to speed up this code,
> it could help me advocate Haskell in the future.
>
> It's supposed to match movie titles from an imported database to a
> reference database. The version I've sent doesn't do anything very smart
> - it's just doing literal title matches. The first argument to the
> program is the filename of the table to be imported, and the second is
> the filename of the reference table. The first line of each table is a
> pipe-separated list of field names; the rest of the lines are records,
> each a pipe-separated list of values.
>
> The import files each have 3,000 records, and the reference table has
> 137,986 records.
>
> Building the hash tables out of the files is quick - it just takes a few
> seconds. But doing the matching of title_id in one table to title_id in
> the other, in a nested loop between both tables, takes way too long.
> It's matching two import titles (against each of the reference titles)
> per second. It needs to do at least 20 per second to qualify for the
> contest, and it's not doing anything fancy yet.
>
> I tried various "improvements" to speed it up. One was to specifically
> use ByteString, eliminating the AbsString class. Didn't make a
> difference. Another was to use arrays instead of lists to store each
> record, and precompute the indices of each of the fields within those
> records. I also iterated over a list of keys instead of the list of
> Maps, and only converted each record to a Map one at a time, hoping they
> would be disposed of sooner. Instead of speeding up the program, this
> slowed it down by a factor of 20!
>
> I've profiled it, and I can't make much out of that. It seemed to be
> spending 25% of its time doing scoring, and I though the problem must be
> due to laziness, but I'm not sure.
>
> So if anyone has any ideas how to speed this up by a factor of at least
> 10 times, it would be really appreciated! Even the Ruby solutions are
> doing that, which is embarrassing.

Do you have some test input online?

-- 
Cheers,
  Lemmih