Space leak due to pairs

Shin-Cheng Mu scm at ipl.t.u-tokyo.ac.jp
Sat May 20 00:18:00 EDT 2006


Hi,

I am trying to fix a space leak which I suspected to be
caused by an old problem involving lazy pattern matching
for pairs. P. Wadler and J. Sparud each suggested fixes
to the compiler to resolve the space leak. I was wondering
whether they are implemented in GHC. In the GHC Users
mailing list archive I found a message in 2004 from Simon Peyton
Jones saying that GHC does have the fix, but its effects
are fragile when mixed with other optimisations:

   http://www.haskell.org/pipermail/glasgow-haskell-users/2004-August/ 
007023.html

I am wondering what the situation is now, and whether there
is a way to make the fix work for my code.

      *    *    *

The problematic code looks like:

   idX (StartEvent a : strm) =
       let (ts, strm') = idX strm
           (us, strm'') = idX strm'
       in (StartEvent a : ts ++ EndEvent a : us, strm'')

The program processes a stream of XML events, and in this case
it simply returns the stream itself with a minimal amount of
validation. The intention of the code was to thread "strm"
through the recursive calls and free the items in strm as soon
as they are processed. So the program should use constant space
in the heap (and stack space proportional to the depth of the
XML tree). However, with this definition I needed about 8Mb of
heap for a 1Mb input file.

The situation is the same even if I do not use -O or -O2.
So it seems that the GC fix did not work for this case.
Is there a way to make it work?

If I do have to alter the code to avoid the space leak,
what is the best way to do it?

      *    *    *

Without the GC fix there are still ways to alter the code to
plug the space leak. The result is quite ugly, therefore
I'd wish that the compiler could do it for me.

I will present the code below. They may serve as evidence that
the space leak does come from the lazy pattern problem. Also,
if there is a better way to code it, please let me know.

Since I suspected that delayed evaluation of us
and strm'' keeps the pointer to strm around for too long, the
natural solution seems to be forcing their evaluation.

So I tried:

   idX (StartEvent a : strm) =
       case idX strm of
          (ts, strm') ->
              case idX strm' of
                (us, strm'') ->
                  (StartEvent a : ts ++ EndEvent a : us, strm'')

which made things worse. The heap size increased a bit more
and I get a typical "bell" shaped heap profile which seems to
suggest that idX cannot output anything before processing the
whole input.

So I have to allow idX to output something while consuming
the input. What eventually worked was this messy mixture
of let and case:

   idX (StartEvent a : strm) =
     let (xs, rest) =
          case idX strm of
            (ts, strm') ->
               let (ws, rest) =
                     case idX strm' of
                       (us, strm'') -> (us, strm'')
               in (ts ++ EndEvent a : ws, rest)
     in (StartEvent a : xs, rest)

The heap residency reduced from about 8MB to around 80Kb.

This is rather tricky, however. The following "nicer looking"
version has a heap profile as bad as the worse case.

    idX (StartEvent a : strm) =
      let (ts,us,strm'') =
          case idXsp strm of
            (ts, strm') ->
                case idXsp strm' of
                  (us, strm'') -> (ts, us, strm'')
      in (StartEvent a [] : ts ++ EndEvent a : us, strm'')

And I cannot quite explain why...

    *   *   *

If you'd like to download the code to do some profiling,
here it is:

     http://www.psdlab.org/~scm/hx.tar.gz

It includes and uses HXML, a library by Joe to
do the parsing. The main program is hxpc.hs. There is a
1 MB sized sample input file size1m.xml. There are two
simple scripts "recompile" and "runhxpc" for compiling
and running the program.

sincerely,
Shin-Cheng Mu






More information about the Glasgow-haskell-users mailing list