[Haskell-cafe] A tale of three shootout entries
s.clover at gmail.com
Mon Nov 26 23:11:11 EST 2007
In some spare time over the holidays I cooked up three shootout
entries, for Fasta, the Meteor Contest, and Reverse Complement. I
should probably have tossed them to haskell-cafe before submission,
for review and ideas, but they're up now. In any case, all three were
great learning experiences, but could use some other eyes and ideas
to be the best that they can.
First up is the meteor-contest entry.
This is the clear win of the bunch, with significantly improved time
thanks to its translation of the better algorithm from Clean.
However, it's still missing something. Nearly all its time is spent
in a tight loop in the solveCell function near the end of the code. I
tried unboxing this, but failed because it spends the bulk of its
time applying a bitwise and between the recursively passed value and
a piece of data retrieved from the masksAtCell data structure which
is of type :: Array (Row,Col) (Array Color [Mask]). (Note that Mask,
Color, Row and Col are all type synonyms for Int that I added for
readability). As each Mask is stored in this list, the masks can't
easily be unboxed -- some sort of custom data structure built using
the FFI seems in order here. If anyone wants to tackle this, I think
it could be a big win for performance.
Next is reverse-complement.
This *would* be a big win except I dimly doubled memory usage from
the previous entry due to filtering newlines explicitly -- which
does, one should note, provide a large performance gain. The solution
here seems the most obvious -- roll the newline stripping into the
destructive modifications performed in the revcomp function, as the
winning C++ entry does. I'll probably get around to this eventually,
but if someone else wants to try to implement this or any other
performance improvements, please jump right in. Additionally, there
might be some other tricks to reducing its memory usage that escape
me at the moment (noting, of course, that using a strict bytestring,
as we should, its unavoidable that we consume the entire contents of
the input at once... I think?)
Finally, there's fasta.
This one really depresses me. It outperforms the previous version by
roughly 20% on my machine (PPC) but underperforms by roughly the same
amount on the shootout box. If you compare it to dons previous
version, the "optimizations" I attempted should be pretty obvious.
First, I precompute the partial sums for the frequency tables and
alter the choose function accordingly. This is a pretty basic measure
that all the better entries seem to do. Next, I unboxed the random
function, which yielded big speedups However, given that we use an
unfoldN, as we should, I couldn't very well pass it a function that
returned something of kind #, (especially as Maybe, which unfoldN
uses, is of kind *->*). Thus, I hid the unrolled loop in a lazy list
of floats that is passed instead of a random seed. I suspect that for
some reason I don't understand relating to differences in processors,
GHC's internal handling of floating point math, or who knows what,
this somehow is the source of the slowdown. If someone with an Intel
Pentium 4 machine comparable to that of the shootout box wants to
take a look at this code and see why it underperforms, I'd be much
obliged. It really seems to me that GHC's fasta performance is far
below where it should be (4x slower than Java!) , and I'd like to get
its numbers up somehow.
p.s. It looks like they've depreciated chameneos in favor of a new
version, chameneos-redux. As this was one of the places Haskell
really rocked the competition, it would probably be worth updating
the Haskell entry for the new benchmark. Also, the n-bodies benchmark
seems like another that could be much improved.
More information about the Haskell-Cafe