[Haskell-cafe] Conduit+GHC high memory use for simple Sink

Michael Snoyman michael at snoyman.com
Thu Aug 28 08:49:45 UTC 2014


On Thu, Aug 28, 2014 at 11:37 AM, Simon Peyton Jones <simonpj at microsoft.com>
wrote:

>  GHC is keeping the entire representation of `lengthM` in memory
>
>
>
> Do you mean that?  lengthM is a function; its representation is just code.
>
>
>

At the time I wrote it, I did. What I was seeing in the earlier profiling
was that a large number of conduit constructors were being kept in memory,
and I initially thought something similar was happening with lengthM. It
*does* in fact seem like the memory problems with this later example are
simply the list being kept in memory. And in fact, there's a far simpler
version of this that demonstrates the problem:

main :: IO ()
main = printLen >> printLen

printLen :: IO ()
printLen = lengthM 0 [1..40000000 :: Int] >>= print

lengthM :: Monad m => Int -> [a] -> m Int
lengthM cnt [] = return cnt
lengthM cnt (_:xs) =
    cnt' `seq` lengthM cnt' xs
  where
    cnt' = cnt + 1

I'll add that as a comment to #7206.

This still doesn't answer what's going on in the original code. I'm
concerned that the issue may be the same, but I'm not seeing anything in
the core yet that's jumping out at me as being the problem. I'll try to
look at the code again with fresher eyes later today.

Michael


>  Perhaps you mean that GHC is keeping the entire list [1..1000000] in
> memory?  Now that certainly makes sense… after all, doing so saves
> allocating (I# 4), (I# 5) etc for each call of the function passed to
> awaitForever.  Granted, it’s probably a bad idea in this case.
>
>
>
> If that is your issue (still to be confirmed) the relevant ticket is
> https://ghc.haskell.org/trac/ghc/ticket/7206; could you add your example
> to that ticket, as further evidence that something should be done?
>
>
>
> See also comment:9 in the ticket, which I have just added.
>
>
>
> Simon
>
>
>
>
>
> *From:* Haskell-Cafe [mailto:haskell-cafe-bounces at haskell.org] *On Behalf
> Of *Michael Snoyman
> *Sent:* 28 August 2014 06:57
> *To:* Bryan Vicknair
> *Cc:* Haskell Cafe
> *Subject:* Re: [Haskell-cafe] Conduit+GHC high memory use for simple Sink
>
>
>
> I actually just got to an interesting result: sink2 is a red herring.
> Consider the following program:
>
> import Control.Monad.IO.Class ( liftIO )
> import Data.Conduit.Internal (ConduitM (..), Pipe (..), (>+>), runPipe,
> awaitForever)
>
> main :: IO ()
> main = runPipe $
>     (HaveOutput (Done ()) (return ()) ()) >+>
>     awaitForever (\_ -> liftIO $ lengthM 0 [1..10000000 :: Int] >>= print)
>
> lengthM :: Monad m => Int -> [a] -> m Int
> lengthM cnt [] = return cnt
> lengthM cnt (_:xs) =
>     cnt' `seq` lengthM cnt' xs
>   where
>     cnt' = cnt + 1
>
>   On my machine, it takes 375MB of memory. What appears to be the cause
> is that GHC is keeping the entire representation of `lengthM` in memory,
> which is clearly a pessimization. I still need to research this further,
> but I thought you'd want to see these results now. (Plus, maybe someone
> else has some other ideas.)
>
> In case anyone wants, the core for this code is available at:
>
> http://lpaste.net/110125
>
>
>
> Michael
>
>
>
> On Thu, Aug 28, 2014 at 8:58 AM, Bryan Vicknair <bryanvick at gmail.com>
> wrote:
>
>  Thanks for the interesting blog posts Michael.  I updated the example
> project
> [1] to use conduit 1.2.  Unfortunately, on my machine [2], my original
> sink2
> still uses about 500Mb of memory when processing 4 gzip files of about 5Mb
> each, while sink1 only uses about 8Mb.  I added sink3, which does the same
> as
> sink2 but uses fold from Conduit.List as you recommended, and that seems to
> work, using about 8Mb.
>
> Looking at the code for sink2 vs sink3, I don't understand what would be
> occupying so much memory in sink2 even in the case of expensive monadic
> binding, or exclusion from stream fusion.  I'm curious if sink2 adds
> thunks to
> the heap that sink3 doesn't, or if the GC is failing to clean up heap
> objects
> in sink2 that is cleans up in sink3.  I'm new at memory profiling, but the
> chart I get with '+RTS -h' or '+RTS -hr' basically just tells me that the
> action function is expensive.
>
> In the real project that inspired this example I'm going to do some
> cleanup,
> replacing manual recursion with higher-level functions from Conduit.List,
> as
> that seems like an all around good idea.
>
>
> Bryan Vicknair
>
> [1] https://bitbucket.org/bryanvick/conduit-mem
> [2] GHC 7.8.3, Arch Linux 3.16.1 kernel x86-64
>
>
> On Thu, Aug 28, 2014 at 07:00:41AM +0300, Michael Snoyman wrote:
> <snip>
>
> > But looking at the code again with fresher eyes than last night: I really
> > don't understand why it had such abysmal performance. I'll look into
> this a
> > bit more, looks like it should be interesting.
> >
> >
> > On Thu, Aug 28, 2014 at 1:39 AM, Dan Burton <danburton.email at gmail.com>
> > wrote:
> >
> > > Michael, I don't see how your code sample for (3) is any different to
> the
> > > compiler than Roman's original sink2.
> > >
> > > I also don't see how the original sink2 creates a bad bind tree. I
> presume
> > > that the reason "fold" works is due to the streaming optimization
> rule, and
> > > not due to its implementation, which looks almost identical to (3).
> > >
> > > I worry about using fold in this case, which is only strict up to WHNF,
> > > and therefore wouldn't necessarily force the integers in the tuples;
> > > instead it would create tons of integer thunks, wouldn't it? Roman's
> > > hand-coded sink2 avoids this issue so I presume that's not what is
> causing
> > > his memory woes.
> > >
> > > -- Dan Burton
> > >
> > >
> > > On Wed, Aug 27, 2014 at 2:55 PM, Roman Cheplyaka <roma at ro-che.info>
> wrote:
> > >
> > >> * Michael Snoyman <michael at snoyman.com> [2014-08-27 23:48:06+0300]
> > >> > > The problem is the following Sink, which counts how many even/odd
> > >> Tokens
> > >> > > are
> > >> > > seen:
> > >> > >
> > >> > >   type SinkState = (Integer, Integer)
> > >> > >
> > >> > >   sink2 :: (Monad m) => SinkState -> Sink Token m SinkState
> > >> > >   sink2 state@(!evenCount, !oddCount) = do
> > >> > >     maybeToken <- await
> > >> > >     case maybeToken of
> > >> > >       Nothing     -> return state
> > >> > >       (Just Even) -> sink2 (evenCount + 1, oddCount    )
> > >> > >       (Just Odd ) -> sink2 (evenCount    , oddCount + 1)
> > >> >
> > >> > Wow, talk about timing! What you've run into here is expensive
> monadic
> > >> > bindings. As it turns out, this is exactly what my blog post from
> last
> > >> > week[1] covered. You have three options to fix this:
> > >> >
> > >> > 1. Just upgrade to conduit 1.2.0, which I released a few hours ago,
> and
> > >> > uses the codensity transform to avoid the problem. (I just tested
> your
> > >> > code; you get constant memory usage under conduit 1.2.0, seemingly
> > >> without
> > >> > any code change necessary.)
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20140828/47b48a40/attachment.html>


More information about the Haskell-Cafe mailing list