[Haskell-cafe] Byte Histogram
Andrew Coppin
andrewcoppin at btinternet.com
Thu Feb 3 22:10:51 CET 2011
There now follows a small grab-bag of miscelaneous but related thoughts.
(Which probably means that this post will spawn a 2000 message thread
discussing one tiny side-issue, rather than the main thrust of the
message...)
First of all, benchmarking. We have The Great Language Shootout, which
tests how fast solutions implemented in various programming languages
can be, given unbounded amounts of time, energy, expertise, and compiler
modifications. Which is interesting, from a certain point of view.
In another sense, it's less interesting. For example, a C program can be
as fast as any Haskell program, since compiling Haskell entails
transforming it *into* C. (Unless you're seriously going to suggest that
GHC's native code generator is any match for the might of a half-decent
C compiler...)
That got me thinking about a more interesting question: Not *how fast*
can it go, but *how easily* can it go fast? People have written fast C
programs and fast Haskell programs, but how easy is it to write a
"typical" program and have it go fast, without spending years optimising it?
With that in mind, I set of with a plan to do some small-scale
benchmarking. Pick a problem, solve it in both Haskell and C++, in the
"simplest, most obvious way", and then apply "the simplest, most obvious
optimisations". Measure performance and compare.
There are benchmarks that just output some data. ("Find all the prime
numbers less than 1000...") I dislike these for a couple of reasons, not
least because you can precompute the correct answer and just write a
program which outputs that. (!) So I decided that all my benchmark
programs should take a variable-sized data file as input, and save the
results as output.
There's an old maxim of running benchmarks multiple times, to ensure
that whatever time you got wasn't just a fluke. But then, maybe the
first run pulls the input file into the file cache, and subsequent runs
go faster. If your input data was randomly generated, perhaps you chose
a particularly optimal or pessimal data set by accident. So I decided
that each test run should use a different input file of approximately
the same characteristics (size, random distribution, etc.)
So I ended up picking benchmarks like "sort the lines of this file into
ascending order", "count the number of unique words in this file",
"produce a histogram of byte values", "compute some statistics of this
list of numbers", etc. The tasks are all extremely simple, so that there
is some hope of it being possible to also implement them in C++.
One benchmark turned out to be particularly interesting: I call it "byte
histogram". The task is simple:
- Open a binary input file.
- Read a stream of bytes from it.
- Count how many times each of the 256 possible byte values appears.
The test inputs are binary files of various sizes, some with a uniform
distribution, some with variously skewed distributions (so that some
bytes have a vastly higher count than others).
Assuming we have some suitable import statements, it's quite easy to do
this:
bytes <- BS.readFile "Input.bin"
let out = BS.foldl' (\ map byte -> MAP.insertWith (+) byte 1 map)
MAP.empty bytes
writeFile "Output.csv" (unlines $ map (\(k,v) -> show k ++ "," ++
show v) $ MAP.toAscList out)
(There is some slight trickiness if you want bytes with zero frequency
to still be listed.)
All of this *works* perfectly - i.e., it produces the correct answers.
It's also astronomically slow, and it's very easy to make it eat many
gigabytes of RAM while it runs. (If the program starts actually
swapping, then you will truly know what "slow" is!)
OK, so what happens if we replace Data.Map with Data.IntMap? Well, it
goes slightly faster, and consumes slightly less RAM. Inputs which
didn't quite complete before will run to completion now. But performance
is still abysmal.
Enough with this tree sillyness. What happens if you use an array? Given
that the entire program (apart from the last 0.02% of it) spends its
time notionally mutating the data, a mutable array would be the logical
choise. Dial up an IOArray Word8 Int and the structure of the program
changes slightly, but it's still only a handful of lines of code. And
the performance? Still glacial.
In particular, suppose we have
inc :: IOArray Word8 Int -> Word8 -> IO ()
inc array byte = do
count <- readArray array byte
let count' = count + 1
writeArray array byte count'
in the main loop. The program is giving us the right answer, but using
absurd amounts of time and space to get it. Now watch this:
inc :: IOArray Word8 Int -> Word8 -> IO ()
inc array byte = do
count <- readArray array byte
let count' = count + 1
count' `seq` writeArray array byte count'
And now, suddenly, memory usage becomes constant, regardless of input
size, and run-time is slashed. The program goes from taking 50 seconds
to process a certain file to taking only 0.02 seconds. And from taking
400 MB of RAM to taking... less RAM than I can actually measure properly.
If we now replace the IOArray with an IOUArray then the seq call becomes
superfluous, and there's a small performance increase, but nothing major.
None of this should be any surprise to anybody who actually knows how
Haskell works. For anybody not yet expert enough to see the problem: The
"inc" function doesn't read the array, see that a given byte has a count
of 5, and overwrite that with a 6. It sees a 5 and overwrites it with a
5+1. By the end of the main loop, every single array cell contains
1+(1+(1+(1+(1+(1+....)))), which takes a stackload of RAM to hold.
That explains why RAM usage is so absurd. On top of all that, creating
all this data takes time, and the spiralling RAM usage makes the GC go
crazy trying to find garbage to get rid of (when of course there is
none). And finally, at the end, you have to unwind all these
expressions. (Curiously, I haven't seen a stack overflow yet...)
Adding the seq call forces 5+1 to actually turn into 6, not eventualy
but *right now*, and so each array cell ends up actually containing a
fully-evaluated integer. (No doubt this change in strictness also
triggers an avalanche of optimisation opportunities, but the basic thing
of importants is that we're not uselessly delaying evaluation now.)
As I say, none of this is a surprise to anybody who actually knows how
to get performance out of Haskell. The strict version is actually faster
than the C++ version. (I imagine I'm doing something horribly wrong with
C++, of course...) Not drastically faster, but faster. Which probably
just means that ByteString is doing nicer buffering than whatever the
C++ standard library is called.
The important obsevation is this: One tiny, almost insignificant change
can transform a program from taking 50 seconds and 400 MB of RAM into
one that takes 0.02 seconds and 0.1 MB of RAM. And, at least in this
case, the simpler version is the slow one.
To say that Haskell is "slow" is both a little bit vague, and not really
backed up by facts. In about 5 minutes flat, I managed to write a
Haskell program that's very simple, and yet faster than a comparably
simple C++ program (and C++ is supposed to be "fast"). So it's not that
Haskell is "slow". It's that Haskell is *tricky*. Tiny, tiny little
changes that look innocuous can have vast effects on performance. And
this is a nice little example of that effect.
As we (probably?) all know, strictness is a subtle and fickle thing.
Consider:
or :: Bool -> Bool -> Bool
or True True = True
or True False = True
or False True = True
or False False = False
Overlooking the obvious name clash, this seems like a perfectly
legitimate definition for the logical-OR function. But now try this:
xs `contains` x = foldr or False $ map (x ==) xs
A nice, pretty Haskell definition. But oh dears - this does not have the
performance you would expect at all! But if you change the definition
above to the apparently equivilent
or True _ = True
or False x = x
then "contains" becomes much faster. If you saw these two functions side
by side, you might well wonder what the hell the difference is. You
might even wonder if the compiler would transform one to the other
(although I'm not sure in which direction). But there is, in fact, a
very, very big difference: one is lazier than the other.
For counting bytes, lazy = bad. For searching for matches, lazy = good.
Similarly, if you take the byte histogram program and replace the lazy
ByteString with a strict one, RAM usage spikes sharply.
In all, laziness is a tricky, tricky little hobbitses. I wrote a C++
program, in the obvious way, and it was very, very fast. I wrote a
Haskell program in several comparably obvious ways, and they were all
cripplingly slow. Until I added the magic "seq", and suddenly got
blinding speed. So while it is not correct to say "Haskell is slow", one
could justifyably say "C++ is fast more reliably than Haskell".
Consider for a moment the original implementation with Data.Map. Adding
a seq or two here will do no good at all; seq reduces to WHNF. What we
are wanting is NF, and I can see no way at all of doing that. (Other
than doing something very expensive like looking up every possible key,
which is far more operations than actually necessary.)
I wrote my own little BST algorithm, customised to the fact that there
are always exactly 256 keys of type Word8. It was also nausiatingly
slow. And then I did something: I inserted a few strictness annotations
on the constructor fields. And (as you already worked out), performance
increased drastically.
I can do that if *I* define the data structure. But if it's in a
library, I am powerless to make it more strict (or less, if that were
what I wanted).
That got me thinking... What would happen if, instead of "Integer", we
had two types, "evaluated Integer" and "possibly unevaluated Integer"?
What if the strictness or otherwise of a data structure were exposed at
the type level?
The general idea here is that anything declared has having type
"evaluated Integer" can never contain an unevaluated expression. It must
always contain an actual integer. This of course implies that you might
be able to unbox it or do other interesting things with it, but it also
says something about evaluation.
The principle question is "what happens if you assign an unevaluated
Integer to an evaluated Integer?" Is that legal? Do you have to
explicitly evaluate it first? Or does that happen implicitly? Presumably
"evaluated Integer" means evaluated to WHNF only - which for an integer
is the same thing as NF, but for some more complex structure it might
not be.
If the type system worked like this, I could do "Data.Map with evaluated
integer keys and possibly evaluated integer values", and it would hold
its values lazily. Or I could say "Data.Map with evaluated integer keys
and evaluated integer values", and the structure would now magically be
strict. Which is to say, the type system would ensure that any data
passed into the structure was evaluated first, and the compiler could do
any applicable optimisations based on the assumption of data being
already evaluated. (For primitive types, that might be unboxing or
passing in registors. For ADTs, it might just mean skipping an
evaluation check.)
Currently, if you want a strict list, you have to implement one
yourself. But is that strict in the spine, or the elements, or what? You
have to implement every combination that you want. And, let us not
forget, then go implement all the functions in Data.List over your new
data structure. Not fun.
If, on the other hand, strictness where in the type system, data
structures would be *parameterised* over the level of strictness required.
I have no idea what the syntax for that would look like, and backwards
compatibility looks like a nightmare. You would need to answer questions
like "what does it mean for a function to return an evaluated type?" and
"is evaluation implicit or explicit?" But it seems like a rather
interesting idea.
(Let us not forget: *Everything* is trivial for the person who doesn't
have to write the implementation...)
More information about the Haskell-Cafe
mailing list