[Haskell-cafe] Re: proposal: HaBench, a Haskell Benchmark Suite

Fri Jun 25 09:01:25 EDT 2010

Hi Simon et al,

On Jun 25, 2010, at 14:39 PM, Simon Marlow wrote:

> On 25/06/2010 00:24, Andy Georges wrote:
> 
>> <snip> 
>> Are there any inputs available that allow the real part of the suite
>> to run for a sufficiently long time? We're going to use criterion in
>> any case given our own expertise with rigorous benchmarking [3,4],
>> but since we've made a case in the past against short running apps on
>> managed runtime systems [5], we'd love to have stuff that runs at
>> least in the order of seconds, while doing useful things. All
>> pointers are much appreciated.
> 
> The short answer is no, although some of the benchmarks have tunable input sizes (mainly the spectral ones) and you can 'make mode=slow' to run those with larger inputs.
> 
> More generally, the nofib suite really needs an overhaul or replacement.  Unfortunately it's a tiresome job and nobody really wants to do it. There have been various abortive efforts, including nobench and HaBench.  Meanwhile we in the GHC camp continue to use nofib, mainly because we have some tool infrastructure set up to digest the results (nofib-analyse).  Unfortunately nofib has steadily degraded in usefulness over time due to both faster processors and improvements in GHC, such that most of the programs now run for less than 0.1s and are ignored by the tools when calculating averages over the suite.

Right. I have the distinct feeling this is a major lack in the Haskell world. SPEC evolved over time to include larger benchmarks that still excercise the various parts of the hardware, such that the benchmarks does not achieve suddenly a large improvement on a new architecture/implementation due to e.g. a larger cache and the working sets remain in the cache for the entire execution. The Haskell community has nothing that remotely resembles a decent suite. You could do experiments and show that over 10K iterations, the average execution time per iteration goes from 500ms to 450ms, but what does this really mean? 

> We have a need not just for plain Haskell benchmarks, but benchmarks that test
> 
> - GHC extensions, so we can catch regressions
> - parallelism (see nofib/parallel)
> - concurrency (see nofib/smp)
> - the garbage collector (see nofib/gc)
> 
> I tend to like quantity over quality: it's very common to get just one benchmark in the whole suite that shows a regression or exercises a particular corner of the compiler or runtime.  We should only keep benchmarks that have a tunable input size, however.

I would suggest that the first category might be made up of microbenchmarks, as I do not think it really is needed for performance per se. However, the other categories really need long-running benchmarks, that use (preferable) heaps of RAM, even when they're well tuned.

> Criterion works best on programs that run for short periods of time, because it runs the benchmark at least 100 times, whereas for exercising the GC we really need programs that run for several seconds.  I'm not sure how best to resolve this conflict.

I'm not sure about this. Given the fact that there's quite some non-determinism in modern CPUs and that computer systems seem to behave chaotically [1], I definitely see the need to employ Criterion for longer running applications as well. It might not  need 100 executions, or multiple iterations per execution (incidentally, those iterations, can they be said to be independent?), but somewhere around 20 - 30 seems to be a minimum. 

> 
> Meanwhile, I've been collecting pointers to interesting programs that cross my radar, in anticipation of waking up with an unexpectedly free week in which to pull together a benchmark suite... clearly overoptimistic!  But I'll happily pass these pointers on to anyone with the inclination to do it.

I'm definitely interested. If I want to make a strong case for my current research, I really need benchmarks that can be used. Additionally, coming up with a good suite, characterising it, can easily result is a decent paper, that is certain to be cited numerous times. I think it would have to be a group/community effort though. I've looked through the apps on the Haskell wiki pages, but there's not much usable there, imho. I'd like to illustrate this by the dacapo benchmark suite [2,3] example. It took a while, but now everybody in the Java camp is (or should be) using these benchmarks. Saying that we just do not want to do this, is simply not plausible to maintain. 

-- Andy

[1]  Computer systems are dynamical systems, Todd Mytkowicz, Amer Diwan, and Elizabeth Bradley, Chaos 19, 033124 (2009); doi:10.1063/1.3187791 (14 pages).
[2] The DaCapo benchmarks: java benchmarking development and analysis, Stephen Blackburn et al, OOPSLA 2006
[3] Wake up and smell the coffee: evaluation methodology for the 21st century, Stephen Blackburn et al, CACM 2008