[Haskell-cafe] [RFC] benchmarks of bytestrings, teaser
Don Stewart
dons at galois.com
Sat Dec 15 22:07:17 EST 2007
firefly:
> What do you think the relative speeds are of the six small haskell
> programs at the end of this email?
>
> All they do is read from stdin and count the number of spaces they see.
> There are two that use strict bytestrings, two that use lazy
> bytestrings, and two that use the standard Haskell strings. Three use a
> recursive function with an accumulator parameter and three use a foldl
> with a lambda function.
>
> Say the fastest one takes the time 1. How much time will the others
> take?
>
> And how about memory? How much memory do you think they require? Let's
> say we feed a 150MB(*) file into each of them, how many megabytes do you
> think they end up using (as seen from the OS, not in terms of how big
> the live heap is)?
>
> I'm going to post full benchmarks + analysis on Wednesday.
>
> -Peter
>
> *) hardddisk megabytes. The file is 150000034 bytes ≈ 143 mebibytes.
>
>
> PS: For extra credit, what do you think is the peak memory use for this
> program when given an input file of 150MB?
Well, I'm not going to wait till Wednesday for the numbers!
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.8.2
$ du -hs 150M
150M
------------------------------------------------------------------------
Program 1:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as B
import GHC.Int (Int64)
-- note that D.BS.Lazy.Char8.length is ByteString -> Int64
-- D.BS.C8.length is ByteString -> Int
cnt :: B.ByteString -> Int64
cnt bs = B.length (B.filter (== ' ') bs)
main = do s <- B.getContents
print (cnt s)
Ok, so lazy bytestrings. Should be constant space use, but two traversals
(since there's no lazy bytestring fusion, over each lazy chunk). Not perfect,
but should be ok. length will consume chunks as they're produced by filter.
** Prediction:
Constant ~3M space use (runtime, plus small overhead)
Fast, due to chunk-wise processing.
** Result:
$ ghc -O2 A.hs -o A --make
$ time ./A < 150M +RTS -sstderr
./A +RTS -sstderr < 150M 1.01s user 0.10s system 98% cpu 1.123 total
And top says 40M allocated.
** Summary: Ok, pretty fast, but an unexpected(!) amount of memory allocated.
Now, this memory result is suspicious, I wonder if the now obsolete 'array fusion'
is messing things up. In Data.ByteString.Lazy, we have:
filter p = F.loopArr . F.loopL (F.filterEFL p) F.NoAcc
We keep meaning to replace this stuff with the stream fusion mechanisms, which
compile a lot better. Perhaps the time has come to look at that :)
I'll put this memory allocation down as a bug that needs to be looked at.
------------------------------------------------------------------------
Program 2:
hs/space-bs-c8-acc-1.hs:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Char8 as B
cnt :: Int -> B.ByteString -> Int
cnt !acc bs = if B.null bs
then acc
else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
main = do s <- B.getContents
print (cnt 0 s)
Strict bytestrings, and you manually fuse the length/filter calculation.
Allocating all that memory will cost you.
** Prediction:
O(N) memory, around 150M allocated
Slower, due to cache effects (more data to traverse) and more indirections.
** Result
top says 154M
./B +RTS -sstderr < 150M 1.10s user 0.52s system 111% cpu 1.454 total
** Summary: Seems reasonable, when its doing all that allocation.
------------------------------------------------------------------------
Program 2a:
But we could easily improve this program: Since:
length (filter (== ' ')
==
length (filterByte ' '
==
count ' '
And we have:
import qualified Data.ByteString.Char8 as B
cnt :: B.ByteString -> Int
cnt x = B.count ' ' x
main = do s <- B.getContents
print (cnt s)
** Prediction: Which should be a bit faster.
** Result:
$ time ./B < 150M
24569024
./B < 150M 0.66s user 0.55s system 113% cpu 1.070 total
** Summary: So that's the fastest program so far.
The rewrite rules to do these transformatoins used to be enabled, but
need looking at again. There should also be no real benefit to manually fuse a
length . filter loop like this, however, the old fusion system used in
bytestring might have some small overhead here. This also needs looking at.
------------------------------------------------------------------------
Program 2b:
We can do even better if we read the file in:
import qualified Data.ByteString.Char8 as B
import System.IO.Posix.MMap
main = print . B.count ' ' =<< mmapFile "150M"
** Prediction: super fast
** Result:
$ time ./B2
24569024
./B2 0.31s user 0.01s system 101% cpu 0.314 total
(similar results if you use vanilla B.readFile too, fwiw).
Summary: This suggests to me we could look again at how strings of unknown size
are read in.
------------------------------------------------------------------------
Program 3:
hs/space-bslc8-acc-1.hs:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as B
cnt :: Int -> B.ByteString -> Int
cnt !acc bs = if B.null bs
then acc
else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
main = do s <- B.getContents
print (cnt 0 s)
Ok, for lazy bytestrings.
** Prediction: Should run in constant space, but there are
more checks than in the unfused length/filter case.
It should be a bit slower, and hopefully run in constant space.
** Result:
$ time ./C < 150M
24569024
./C < 150M 2.36s user 0.11s system 99% cpu 2.489 total
top says 3804K
** Summary:
So it runs in the 3M constant space I'd expected the original program to run in,
but its a fair bit slower. The generated code looks pretty good though.
Investigate why this is slower.
------------------------------------------------------------------------
Program 4:
==============================
hs/space-xxxxx-acc-1.hs:
{-# LANGUAGE BangPatterns #-}
cnt :: Int -> String -> Int
cnt !acc bs = if null bs
then acc
else cnt (if head bs == ' ' then acc+1 else acc) (tail bs)
main = do s <- getContents
print (cnt 0 s)
** Prediction:
Lazy, so constant space, but probably 10x slower than the previous program.
** I'm not going to bother with this one, since strings suck for large data.
------------------------------------------------------------------------
Program 5:
hs/space-bs-c8-foldlx-1.hs:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Char8 as B
cnt :: B.ByteString -> Int
cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs
main = do s <- B.getContents
print (cnt s)
Ok, a strict foldl'.
** Prediction:
Should be similar to the strict bytestring first example, and a little
faster due to no redundant null checks.
** Result
$ time ./D < 150M
24569024
./D < 150M 1.02s user 0.58s system 111% cpu 1.436 tota
** Summary: As expected
------------------------------------------------------------------------
Program 6:
==============================
hs/space-bslc8-foldlx-1.hs:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as B
cnt :: B.ByteString -> Int
cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs
main = do s <- B.getContents
print (cnt s)
A strict fold over a lazy bytestring.
Prediction: Ok, constant memory use, and should be similar in speed
to the first length. filter program. It'll be faster than manual
accumulating lazy bytestring program, since the redundant null checks
are gone from the inner loop.
Result:
$ time ./E < 150M
24569024
./E < 150M 0.84s user 0.07s system 98% cpu 0.928 total
2468k memory.
Wow. Didn't expect it to be that fast. So properly lazy, and nicely fast.
This is what we'd hope to see :) It also re-enforces that there's a bug in the
length . filter program.
------------------------------------------------------------------------
Program 7:
==============================
hs/space-xxxxx-foldl.hs:
{-# LANGUAGE BangPatterns #-}
cnt :: String -> Int
cnt bs = foldl (\sum c -> if c == ' ' then sum+1 else sum) 0 bs
main = do s <- getContents
print (cnt s)
Hmm. Lazy accumulator eh, on String? Should exhibit a space leak.
Not worth running..
**********************
Summary,
* Program 1 is fast, as expected, but exhbits a bug in the bytestring
library's lazy bytestring fusion system. Something in length or filter
isn't doing the right job. This code will be replaced by the stream fusion
system soon.
* Program 2: as expected. strict IO uses O(N) space, and that has performance
effects.
* Program 3: lazy bytestrings use constant space, but you better avoid
redundant bounds checks in the inner loops.
* Program 4: strings are silly
* Program 5: as expected. similar to program 2.
* Program 6: strict foldl's over lazy bytestrings are good :)
fast, and constant space.
* Program 7: see program 4.
Pretty much as expected then, but with a bug identified in lazy bytestring fusion (I think).
Nice little benchmark.
-- Don
More information about the Haskell-Cafe
mailing list