[Haskell-cafe] [16/16] SBM: Discussion and Conclusion

Sat Dec 22 04:17:30 EST 2007

General
-------
Bytestrings are faster than haskell strings.  This is good and expected.
But their memory use, ouch!  There seems to be a grave memory bug in 6.9.

Lazy bytestrings are slower than strict bytestrings -- this was completely
unexpected for me.

The strictness analyzer works *very* well.  It rarely made a difference to mark
arguments as strict (and when it did, it was very easy to use).

I/O
---
The I/O patterns seen in the various programs are:
 1) strict bytestrings = read increasingly large chunks (they double every
    time).  I don't know if the already-read data have to be copied around or
    if the memory allocator can handle extending existing blocks (if they are
    at the "front" of the heap, say) like realloc() can.

    Even if the blocks do get copied around, that's not where the performance
    limiter is (speed-wise).  It does seem to be a bit unwise in terms of
    memory pressure.

    Does the gc kick in every time the heap has to expand?  Does it do a full
    collection then?  If no other allocations have happened since the last
    collection than the allocation that caused the heap to expand, then perhaps
    skip the collection?  (or some criteria similar to that)

 2) lazy bytestrings = read blocks of 32K-8 bytes, as expected.  The C
    benchmarks show that there's no penalty for reading 32K-8 vs. 32K.

 3) native strings = read blocks 8K.

The C benchmarks show that it barely matters if the block size is 4K, 32K, or
32K-8 bytes.  In any case, the small differences due to block size are
completely in the noise.  Reading very large blocks, though, as the strict
bytestrings do, actually has a cost (38-70% slower, depending on the CPU and
kernel and RAM and busspeeds, etc -- see email 10 in the series).  It is still
peanets compared for almost all the benchmarks.

Backend
-------
The backend turns out to be rather bad.  It is generally very easy to improve
the generated code by hand without doing anything in terms of sophisticated
analysis.  One can rewrite inner loops using heroic code (load four characters
at a time together into a 32-bit register or even use MMX to handle eight
characters in parallel) but it isn't really necessary to do that to gain a
good improvement and start being competitive with C.

The backend is probably what is costly on some of the lazy bytestring and
native string code because there are too many tests to see if the buffer has
been exhausted (I think).  Some simple hoisting and common subexpression
elimination would go far here.

For the simpler strict bytestring benchmarks, heroic backend optimizations are
not necessary because they give less of an improvement over merely competent
register use than sensible I/O and buffer management (cache effects start to
be important once the compiler generates sensible code).

I would have liked to examine backend performance further by playing with the
generated C-- code but unfortunately, the C-- that ghc emits won't compile.

Dissecting the generated machine code for the more complicated benchmarks was
so hard that I skipped it -- I can't even recognize the labels!  It's one thing
if the backend generates lots of extra labels but another if the expected
labels simply are not there!  (this is why tools/cut.pl complains that it can't
find the main loop for some of the programs -- if the */*.discut file is
unexpectedly short, then that's the reason.)

Quality of measurements
-----------------------
The speed measurements are of extremely high quality, much more than it turns
out is needed at this moment.  I probably spent a bit too much effort there.

The memory measurements are unusual.  I don't recall having seen anything
like them elsewhere because one seldom has to work with uncooperative programs
when benchmarking (the preloading trick works with any program, as long as it
is dynamically linked on Linux and uses the standard loader).  They are
probably the biggest single contribution from this work, together with the
visualization provided by the bar charts.

Thoroughness
------------
It really is necessary to benchmark more than one ghc/library combination,
otherwise the bad memory bugs wouldn't stand out so clearly or I would have
believed, as Don Stewart did, that he had fixed the memory allocation bug.

It also matters what microarchitectures one use, as pointed out by email 10
in this series.

And you gotta have many benchmarks, otherwise you might not really be measuring
what you think you measure.  I could probably have gotten away with fewer
benchmarks in this case but that would have left a nagging feeling at the back
of my head that there was a benchmark I /could/ have included that would have
showed up a spectacular performance bug.

Using so big input files (150MB) made the programs run slow enough that the
timer granularity (1ms) was small compared to their runtime.  It also made the
memory performance problems really stand out (and be unignorable).  I think
that was a good idea.

Having automatic feedback on the timing measurement quality helped a lot during
the development of the testing harness because it made it easier and faster for
me to eliminate the causes of imprecision.  That's why I wrote the eatmem tool
and that's why the test harness does a dd of the input file and the compiled
program before running the tests.  It should probably also dd the C library
(and the few other libraries linked in: math, GNU multiprecision, and TLS
stuff) but the quality estimators show that any improvement would be very
small.

Using top together with huge input files convinced me that -sstderr was
untrustworthy so I developed the pause-at-end preloading hack.  Paranoia paid
off big time there.

Visual presentation
-------------------
The bar charts (with UTF-8) turned out to be so much more readably than raw
numbers.  This is a trick I learned from the Cairo team, which created a
performance suite very early on.  The bar is drawn with '%' if the measurement
is bad.  That turned out to be a much clearer visual indicator than my previous
'*' and '!' flag characters.  Really, there's no good reason not to use UTF-8
these days.

I also discovered that the program names used for the benchmarks really matter.
I had started out with names like 'cntspaces12' and 'cntbytes34' and kept
getting confused about what program measured what.  If the names are good, you
don't need to put a description of each benchmark on the report.

The order the benchmarks are presented in also matters.  It is important that
they are grouped and it is also important that the order is as stable as
possible, even when not all runs contain exactly the same benchmarks.

Separation of the time and memory bar charts + the separation of byte and space
counting benchmarks also turned out to be useful.  I had neither at first.

And finally, merging of reports.  I had no idea it would be so useful before I
wrote the scripts for it.  This is what makes the memory bugs in 6.9 really
stand out and this is what shows the improvements of bytestring 0.9.0.2 (in
particular with ghc 6.8.2).  It also shows how wildly the performance of four
different microarchitectures can differ.

Scripting
---------
Scripting everything is key to the success of an effort like this.  I am VERY
happy that scripted not just the running of the benchmarks but also the report
handling and even the installation of the ghc/bytestring combinations.

(I also scripted the sending of these 17 emails -- let's see if the script
works :) )

The predictions elicited by my teaser email
-------------------------------------------
As expected, predicting the performance of haskell programs is really hard,
even modulo performance bugs.

And, no, strings are not "silly".

Action Items
------------
Suggested short-term actions:
 o Can the allocator expand blocks?  If not, see if it can be implemented.
 o Find out why lazy bytestrings "by hand" are that slow - is it due to
   lack of hoisting in the backend?
 o Improve backend, mainly by not hiding information from register allocator.
   Alternatively, by using an assembly-to-assembly transformator that
   improves the register allocation.
 o Fix -sstderr.
 o perhaps even let the RTS print out /proc/self/maps or /proc/self/status
   or "peak RSS" returned by getrusage() in the ru_maxrss field?
 o Find memory leaks in ghc 6.9 code.
 o ghc should allow round-tripping of the C-- code it generates.
 o would be awfully nice if the backend didn't sometimes throw the recognizable
   labels away.
 o can the core/stg code be made easier to read?

-Peter