[Haskell-cafe] space leaks and optimizations

Fri Jan 8 06:04:06 EST 2010

Dear Haskellers,

Recently I have been looking for a programming language that would be
suitable for small scientific and recreational projects and palatable to
a picky person like me. (I do theoretical physics and some math; I do
not program very often.)  Haskell and Clean look attractive from a
mathematician's point of view and allow for very elegant solutions in
some cases. I tried Clean first because it has a more efficient
implementation and better array support. However, I am leaning toward
Haskell since it has more libraries, larger community, and some nice
features (e.g., the "do" notation).

I thought it would be easier to avoid errors when programming in a pure
functional language. Indeed, the Haskell type system is great and
catches many errors. But in practice, I found it very difficult to write
correct programs in Haskell! This is because my definition of
correctness includes asymptotic complexity (time and space). I have
pondered why Haskell is so prone to space leaks and whether this can be
fixed. I posted a related message (describing a space leak caused by
inlining) to Glasgow-haskell-users,
http://www.haskell.org/pipermail/glasgow-haskell-users/2009-November/018063.html
but apparently the GHC developers were busy preparing a new release.
Perhaps on Haskell-cafe there are more people with some spare time; I
would really appreciate your comments.

So, there seem to be several reasons why controlling space usage is
difficult:

1) The operational semantics of Haskell is not specified. That said, it
seems that unoptimized programs behave in a predictable way if you
follow the execution step by step. The Clean report explicitly says that
the execution model is graph reduction; I believe that Haskell uses the
same model. However, there are some subtleties, e.g., the tail call and
selector optimizations. (I read about the
selector optimization in Wadler's paper,
http://homepages.inf.ed.ac.uk/wadler/topics/garbage-collection.html
and saw it mentioned on the GHC development page. It's really nice and
indispensable, but it seems to be missing from the user documentation.
Is this the following description accurate? After the rhs of a lazy
pattern like
(a,b) = expression
has been evaluated, the pattern does not take up any space, and the
space occupied by a and b can be reclaimed independently.) There must be
more subtleties. I imagine that a rigorous definition of operational
semantics would be too complicated and impractical. But maybe an
informal specification is better than nothing? Why people have not
attempted to write it?

2) While each step is predictable, the overall behavior of a lazy
program can be rather surprising. So one must be very careful. GHC
provides two ways to control the evaluation order, seq and bang
patterns, but I am not sure which of these (if any) is the right tool.
Consider the following example (from the Real World Haskell book):
mean :: [Double] -> Double
mean xs = sum / fromIntegral num where
    (num,sum) = foldl' f (0,0) xs :: (Int, Double)
    f (n,s) x = (n+1, s+x)
Although it uses foldl', there is a space leak because Haskell tuples
are not strict. There are two possible fixes:
    f (!n,!s) x = (n+1, s+x)
or
    f (n,s) x = let n1=n+1; s1=s+x in n1 `seq` s1 `seq` (n1,s1)
The first one is short, but less natural than the second one, where the
offending thunks are suppressed on the spot. Unfortunately, the syntax
is awkward; it would be nice to write something like
    f (n,s) x = (!n+1, !n+1)
Well, I am becoming too grumpy, perhaps the bang patterns are fine. More
important question: are there established practices to *avoid* space
leaks rather than fixing them afterwards?

3) The standard library was not designed with space efficiency in mind;
for example, there is sum but no sum'.

4) GHC does a pretty good job optimizing programs for speed, but
compiling with the -O option can easily introduce a space leak. I have
not encountered such problems with Clean, though my experience is very
limited. Apparently, the Clean developers have disallowed some unsafe
optimization, see e.g., this message:
http://mailman.science.ru.nl/pipermail/clean-list/2009/004140.html
I doubt that the Clean solution is 100% reliable because the compiler
still uses strictness analysis, which can change the asymptotic space
usage (usually for better, but sometimes for worse). So I wonder whether
it would be feasible to identify a set of conservative optimizations
that do not change the space or time usage more than by a constant
factor. For example, the strictness analysis could be limited to
fixed-size data. Or instead of strictness analysis, one could inline the
top-level patterns of every function. Of course, that would make the
optimization less efficient on the average, but predictability is more
important.

Best regards,
Alexei