[Haskell-cafe] [16/16] SBM: Discussion and Conclusion

Sat Dec 22 21:02:53 EST 2007

firefly:
> General
> -------
> Bytestrings are faster than haskell strings.  This is good and expected.
> But their memory use, ouch!  There seems to be a grave memory bug in 6.9.

Bug report -- since the bytestring library hasn't changed between 6.8.2
and head, that sounds like a bug report.

> Lazy bytestrings are slower than strict bytestrings -- this was completely
> unexpected for me.

Which programs should I look at? Is there a specific program or two
where lazy bytestring performance is entirely out of whack (against
6.8.2?)? If so, please forward it to me and Duncan.

> 
> I/O
> ---
> The I/O patterns seen in the various programs are:
>  1) strict bytestrings = read increasingly large chunks (they double every
>     time).  I don't know if the already-read data have to be copied around or
>     if the memory allocator can handle extending existing blocks (if they are
>     at the "front" of the heap, say) like realloc() can.

if you're reading from stdin, and we don't know the size, bytestring
getContents uses realloc.

>     Even if the blocks do get copied around, that's not where the performance
>     limiter is (speed-wise).  It does seem to be a bit unwise in terms of
>     memory pressure.
> 
>     Does the gc kick in every time the heap has to expand?  Does it do a full
>     collection then?  If no other allocations have happened since the last
>     collection than the allocation that caused the heap to expand, then perhaps
>     skip the collection?  (or some criteria similar to that)
> 
>  2) lazy bytestrings = read blocks of 32K-8 bytes, as expected.  The C
>     benchmarks show that there's no penalty for reading 32K-8 vs. 32K.
> 
>  3) native strings = read blocks 8K.
> 
> The C benchmarks show that it barely matters if the block size is 4K, 32K, or
> 32K-8 bytes.  In any case, the small differences due to block size are
> completely in the noise.  Reading very large blocks, though, as the strict
> bytestrings do, actually has a cost (38-70% slower, depending on the CPU and
> kernel and RAM and busspeeds, etc -- see email 10 in the series).  It is still
> peanets compared for almost all the benchmarks.

Useful information, thanks.

> Backend
> -------
> The backend turns out to be rather bad.  It is generally very easy to improve
> the generated code by hand without doing anything in terms of sophisticated
> analysis.  One can rewrite inner loops using heroic code (load four characters
> at a time together into a 32-bit register or even use MMX to handle eight
> characters in parallel) but it isn't really necessary to do that to gain a
> good improvement and start being competitive with C.
> 
> The backend is probably what is costly on some of the lazy bytestring and
> native string code because there are too many tests to see if the buffer has
> been exhausted (I think).  Some simple hoisting and common subexpression
> elimination would go far here.

There may be things we can do at the library level too. We changed the
lazy bytestring representation in the last release cycle to remove a
number of tests. If you've a specific example would help though.

> Thoroughness
> ------------
> It really is necessary to benchmark more than one ghc/library combination,
> otherwise the bad memory bugs wouldn't stand out so clearly or I would have
> believed, as Don Stewart did, that he had fixed the memory allocation bug.

I fixed a particular issue with the library. But please make a bug
report if there are other issues wrt. ghc head in particularly.

> Action Items
> ------------
> Suggested short-term actions:
>  o Can the allocator expand blocks?  If not, see if it can be implemented.
>  o Find out why lazy bytestrings "by hand" are that slow - is it due to
>    lack of hoisting in the backend?

Example please!

>  o Improve backend, mainly by not hiding information from register allocator.
>    Alternatively, by using an assembly-to-assembly transformator that
>    improves the register allocation.

Some nice examples that could help motivate the native gen hackers would
be useful.

>  o Fix -sstderr.
>  o perhaps even let the RTS print out /proc/self/maps or /proc/self/status
>    or "peak RSS" returned by getrusage() in the ru_maxrss field?
>  o Find memory leaks in ghc 6.9 code.

Bug report time.

>  o ghc should allow round-tripping of the C-- code it generates.
>  o would be awfully nice if the backend didn't sometimes throw the recognizable
>    labels away.
>  o can the core/stg code be made easier to read?

Can you ensure these points are summarised and directed to the ghc bug
tracker,

    http://hackage.haskell.org/trac/ghc/newticket?type=bug

to this work isn't lost on -cafe at .

-- Don