[Haskell-cafe] A tale of three shootout entries

Mon Nov 26 23:11:11 EST 2007

In some spare time over the holidays I cooked up three shootout  
entries, for Fasta, the Meteor Contest, and Reverse Complement. I  
should probably have tossed them to haskell-cafe before submission,  
for review and ideas, but they're up now. In any case, all three were  
great learning experiences, but could use some other eyes and ideas  
to be the best that they can.

First up is the meteor-contest entry.

http://shootout.alioth.debian.org/gp4/benchmark.php? 
test=meteor&lang=ghc&id=5

This is the clear win of the bunch, with significantly improved time  
thanks to its translation of the better algorithm from Clean.  
However, it's still missing something. Nearly all its time is spent  
in a tight loop in the solveCell function near the end of the code. I  
tried unboxing this, but failed because it spends the bulk of its  
time applying a bitwise and between the recursively passed value and  
a piece of data retrieved from the masksAtCell data structure which  
is of type :: Array (Row,Col) (Array Color [Mask]). (Note that Mask,  
Color, Row and Col are all type synonyms for Int that I added for  
readability).  As each Mask is stored in this list, the masks can't  
easily be unboxed -- some sort of custom data structure built using  
the FFI seems in order here. If anyone wants to tackle this, I think  
it could be a big win for performance.

Next is reverse-complement.

http://shootout.alioth.debian.org/gp4/benchmark.php? 
test=revcomp&lang=ghc&id=3

This *would* be a big win except I dimly doubled memory usage from  
the previous entry due to filtering newlines explicitly -- which  
does, one should note, provide a large performance gain. The solution  
here seems the most obvious -- roll the newline stripping into the  
destructive modifications performed in the revcomp function, as the  
winning C++ entry does. I'll probably get around to this eventually,  
but if someone else wants to try to implement this or any other  
performance improvements, please jump right in. Additionally, there  
might be some other tricks to reducing its memory usage that escape  
me at the moment (noting, of course, that using a strict bytestring,  
as we should, its unavoidable that we consume the entire contents of  
the input at once... I think?)

Finally, there's fasta.

http://shootout.alioth.debian.org/gp4/benchmark.php? 
test=fasta&lang=ghc&id=2

This one really depresses me. It outperforms the previous version by  
roughly 20% on my machine (PPC) but underperforms by roughly the same  
amount on the shootout box. If you compare it to dons previous  
version, the "optimizations" I attempted should be pretty obvious.  
First, I precompute the partial sums for the frequency tables and  
alter the choose function accordingly. This is a pretty basic measure  
that all the better entries seem to do. Next, I unboxed the random  
function, which yielded big speedups However, given that we use an  
unfoldN, as we should, I couldn't very well pass it a function that  
returned something of kind #, (especially as Maybe, which unfoldN  
uses, is of kind *->*). Thus, I hid the unrolled loop in a lazy list  
of floats that is passed instead of a random seed. I suspect that for  
some reason I don't understand relating to differences in processors,  
GHC's internal handling of floating point math, or who knows what,  
this somehow is the source of the slowdown. If someone with an Intel  
Pentium 4 machine comparable to that of the shootout box wants to  
take a look at this code and see why it underperforms, I'd be much  
obliged. It really seems to me that GHC's fasta performance is far  
below where it should be (4x slower than Java!) , and I'd like to get  
its numbers up somehow.

Thanks,
Sterl

p.s. It looks like they've depreciated chameneos in favor of a new  
version, chameneos-redux. As this was one of the places Haskell  
really rocked the competition, it would probably be worth updating  
the Haskell entry for the new benchmark. Also, the n-bodies benchmark  
seems like another that could be much improved.