[Haskell-cafe] idea for avoiding temporaries

Fri Mar 9 10:58:43 EST 2007

On Mar 8, 2007, at 12:00 PM, David Roundy wrote:

> Hi all,
>
> I was just teaching a class on minimization methods, with a focus on
> conjugate gradient minimization in particular, and one of my main  
> points
> was that with conjugate gradient minimization we only need three or  
> four
> arrays of size N (depending on whether we use the Fletcher-Reeves or
> Polak-Ribiere variant), ...  This got me thinking about one of the  
> largest problems
> with writing serious numerical code in Haskell, which is that of  
> memory
> consumption and avoiding temporary variables.

I've been following this discussion with interest, as I've been  
looking in some detail at conjugate gradient algorithms as part of my  
day job, and I've spent a good deal of time thinking about exactly  
the problems you raise.  For those following along at home, here's a  
sample somewhat-imperative CG algorithm (this is the simplified  
stripped-down version):

     for j <- seq(1#cgit) do
       q = A p
       alpha = rho / (p DOT q)
       z += alpha p
       rho0 = rho
       r -= alpha q
       rho := r DOT r
       beta = rho / rho0
       p := r + beta p
     end

Here p,q,r, and z are vectors, A is the derivative of our function  
(in this case a sparse symmetric positive-definite matrix, but we can  
really think of it as a higher-order function of type Vector->Vector)  
and the greek letters are scalars.  The "answer" is z.  In practice  
we'd not run a fixed number of iterations, but instead do a  
convergence test.  All the hard work is in the line "q = A p", but  
the storage consumption is mostly in the surrounding code.  On a  
parallel machine (where these sorts of programs are often run) this  
part of the algorithm has almost no parallelism---all those dot  
products and normalizations preclude it---but the A p step and the  
dot products themselves are of course quite parallelizable.

Sadly, many of the suggestions, while generally sound, just don't  
apply well to this particular use case.

* As other have noted, burying the updatable state in a monad just  
begs the question.  The resulting code looks nothing at all like the  
mathematics, either, which is a big problem if you're trying to  
understand and maintain it (The above code is virtually isomorphic to  
the problem specification).  I'm sure David is seeking a more- 
declarative version of the code than the spec from which we were  
working.  Note that rather than embedding all the state in a special  
monad, we might as well be using update-in-place techniques (such as  
the += and -= operations you see above) with the monads we've already  
got; the result will at least be readable, but it will be too  
imperative.

* There are opportunities for loop fusion in CG algorithms---we can  
compute multiple dot products on the same array in a single loop--- 
but these have the flavor of fusing multiple consumers of a data  
structure into a single consumer, which I believe is still an  
unsolved problem in equational frameworks like foldr/build or  
streams.  It's a bit like fusing:

    n = foldl' (+) 0 . map (const 1) $ xs
    sum_xs = foldl' (+) 0 $ xs
    sum_sq = foldl' (+) 0 . map (\x->x*x) $ xs

into:

    (n,sum_xs,sum_sq) =
       foldl' (\(a0,b0,c0) (an,bn,cn)->(a0+an, b0+bn, c0+cn)) (0,0,0) .
       map (\x->(const 1 x, x, x*x)) $ xs

which we understand how to do, but not equationally (or at least we  
didn't last I looked, though Andy Gill and I had both fantasized  
about possibilities for doing so).

None of these fusion opportunities actually save space, which makes  
the problem tricker still.

* The algorithm as written already tries to express minimal storage.   
The only question is, do +=, -=, and := update their left-hand side  
in place, or do we think of them (in the Haskell model of the  
universe) as fresh arrays, and the previous values as newly-created  
garbage?  My challenge to fellow Haskell hackers: find a way to  
express this such that it doesn't look so imperative.

* Even if we do a good job with += and -=, which is what David seems  
to be looking for, we still have those irritating := assignments--- 
we'd like to throw out the old p and reuse the space in the last  
line.  And we'd like to have one piece of storage to hold the q on  
each successive iteration.  So even if we solve David's problem, we  
still haven't matched the space requirements of the imperative code.

* DiffArrays are too expensive to be acceptable here.  It's not even  
a question of unboxing.  Let's say we keep the current array in fast,  
unboxed storage; this lets us read its elements using a single load.   
Each update still needs to retrieve the old data from the array and  
add it to the old-versions lookup table; together these operations  
are much more expensive than the actual update to the current version  
of the table.  And we need to do this even though no old versions  
exist!  We should be able to avoid this work entirely.  (And, if old  
versions do exist, a DiffArray is the pessimal representation for  
them given that we're updating the whole array).

* Linear or Uniqueness types are almost what we want.  I think Josef  
Svenningson was the one who captured this the best: Uniqueness type  
*require* that the *caller* of these routines make sure that it is  
not sharing the data.  So we need two copies of our linear algebra  
library---one which takes unique arrays as arguments and updates in  
place, and one which takes non-unique arrays and allocates.  And we  
have to pick the right one based on context.  What we want, it seems  
to me, is one library and a compiler which can make informed judgments.

* We could imagine tracking uniqueness dynamically at run time, using  
something like reference counting for all our arrays.  But we need to  
do the reference counting precisely---this is pretty much the most  
inefficient way possible of tracking the storage, and doesn't play  
well at all with using efficient GC elsewhere in our programs.  The  
inefficiency might be worth it for arrays, but Haskell is polymorphic  
and in many cases we need to treat all our data the same way.

* Finally, I'll observe that we often want to use slightly different  
algorithms depending upon whether we're updating in place or  
computing into fresh storage.  Often copying the data and then  
updating it in place is not actually a good idea.

I'd love to hear if anyone has insights / pointers to related work on  
any of the issues above; I'm especially keen to learn if there's work  
I didn't know about on fusion of multiple traversals.  In my day job  
with Fortress we are looking at RULES-like approaches, but they  
founder quickly because the kind of problems David is trying to solve  
are 90% of what our programmers want to do.

-Jan-Willem Maessen