ByteString I/O Performance

Mon Sep 3 04:56:07 EDT 2007

On Mon, 2007-09-03 at 01:40 +0200, Peter Simons wrote:
> Donald Bruce Stewart writes:
> 
>  > It's been a while since I benchmarked the IO performance, so
>  > looks like time to revisit this issue.
> 
> Hi Bruce,
> 
> my impression is that performance cannot be improved significantly
> without providing a modified input API. For some purposes reading
> input in large chunks is fine, but for some purposes it is not. A
> network proxy server, for example, cannot afford to buffer large
> amounts of data for every connection. An I/O buffer size of, say
> 256KB, would be really not a good choice for such a program. Something
> like 4KB typically is, but the small buffer size means that large
> amounts of data will be read using a fairly large number of hGet
> calls. As it is, hGet performs at least one malloc() per read() call.
> That will be slow, no matter how optimized that code is.
> 
> One way to get malloc() out of the picture would be to provide a
> variant of hGet that takes an existing, pre-allocated buffer as an
> argument, so that the user can allocate a ByteString once and re-use
> it for every single hGet and hPut.

Stefan is right, the only place that C malloc() is used is in strict
ByteString's hGetContents. Everywhere else we use GHC's pinned heap
allocation.

Strict ByteString ReadFile does not use hGetContents because in that
case we know the file length, it uses hGetBuf.

Also, createAndTrim does not do any copying when there is no trimming
necessary, so in the best case we do only a single copy using hGetBuf.

As I recall from when I profiled this for the ByteString paper, with a
lazy ByteString implementation of unix 'cat' on disk files (rather than
a network socket) we should only copy each chunk once (as far as I can
see). The slowdown compared to a simple hGetBuf 'cat' was all down to
cache locality, because we're cycling between a range of buffers rather
than a single cache-hot buffer. The time overhead of memory allocation
and GC is negligible. In the tests I did at the time I found that on
fully cached files the slowdown compared to using a single mutable
buffer was about 2-3x.

I figured that overhead is not bad, considering it represents the worst
case when no work is being done to transform the data in any way and
we're not doing any real IO, just copying data from kernel memory to
user space memory.

If we're doing lots of short reads (like when reading from sockets) then
there is an opportunity for improvement. We could read into an internal
buffer and cut off as an immutable ByteString only the chunk that got
filled. The remainder of the buffer could be used for the following
reads until the whole buffer is exhausted and a new buffer has to be
allocated. This is the buffering strategy we use in the binary library
when serialising.

Duncan