[Haskell-cafe] Conduit+GHC high memory use for simple Sink

Michael Snoyman michael at snoyman.com
Thu Aug 28 05:56:34 UTC 2014


I actually just got to an interesting result: sink2 is a red herring.
Consider the following program:

import Control.Monad.IO.Class ( liftIO )
import Data.Conduit.Internal (ConduitM (..), Pipe (..), (>+>), runPipe,
awaitForever)

main :: IO ()
main = runPipe $
    (HaveOutput (Done ()) (return ()) ()) >+>
    awaitForever (\_ -> liftIO $ lengthM 0 [1..10000000 :: Int] >>= print)

lengthM :: Monad m => Int -> [a] -> m Int
lengthM cnt [] = return cnt
lengthM cnt (_:xs) =
    cnt' `seq` lengthM cnt' xs
  where
    cnt' = cnt + 1


On my machine, it takes 375MB of memory. What appears to be the cause is
that GHC is keeping the entire representation of `lengthM` in memory, which
is clearly a pessimization. I still need to research this further, but I
thought you'd want to see these results now. (Plus, maybe someone else has
some other ideas.)

In case anyone wants, the core for this code is available at:

http://lpaste.net/110125

Michael


On Thu, Aug 28, 2014 at 8:58 AM, Bryan Vicknair <bryanvick at gmail.com> wrote:

> Thanks for the interesting blog posts Michael.  I updated the example
> project
> [1] to use conduit 1.2.  Unfortunately, on my machine [2], my original
> sink2
> still uses about 500Mb of memory when processing 4 gzip files of about 5Mb
> each, while sink1 only uses about 8Mb.  I added sink3, which does the same
> as
> sink2 but uses fold from Conduit.List as you recommended, and that seems to
> work, using about 8Mb.
>
> Looking at the code for sink2 vs sink3, I don't understand what would be
> occupying so much memory in sink2 even in the case of expensive monadic
> binding, or exclusion from stream fusion.  I'm curious if sink2 adds
> thunks to
> the heap that sink3 doesn't, or if the GC is failing to clean up heap
> objects
> in sink2 that is cleans up in sink3.  I'm new at memory profiling, but the
> chart I get with '+RTS -h' or '+RTS -hr' basically just tells me that the
> action function is expensive.
>
> In the real project that inspired this example I'm going to do some
> cleanup,
> replacing manual recursion with higher-level functions from Conduit.List,
> as
> that seems like an all around good idea.
>
>
> Bryan Vicknair
>
> [1] https://bitbucket.org/bryanvick/conduit-mem
> [2] GHC 7.8.3, Arch Linux 3.16.1 kernel x86-64
>
>
> On Thu, Aug 28, 2014 at 07:00:41AM +0300, Michael Snoyman wrote:
> <snip>
> > But looking at the code again with fresher eyes than last night: I really
> > don't understand why it had such abysmal performance. I'll look into
> this a
> > bit more, looks like it should be interesting.
> >
> >
> > On Thu, Aug 28, 2014 at 1:39 AM, Dan Burton <danburton.email at gmail.com>
> > wrote:
> >
> > > Michael, I don't see how your code sample for (3) is any different to
> the
> > > compiler than Roman's original sink2.
> > >
> > > I also don't see how the original sink2 creates a bad bind tree. I
> presume
> > > that the reason "fold" works is due to the streaming optimization
> rule, and
> > > not due to its implementation, which looks almost identical to (3).
> > >
> > > I worry about using fold in this case, which is only strict up to WHNF,
> > > and therefore wouldn't necessarily force the integers in the tuples;
> > > instead it would create tons of integer thunks, wouldn't it? Roman's
> > > hand-coded sink2 avoids this issue so I presume that's not what is
> causing
> > > his memory woes.
> > >
> > > -- Dan Burton
> > >
> > >
> > > On Wed, Aug 27, 2014 at 2:55 PM, Roman Cheplyaka <roma at ro-che.info>
> wrote:
> > >
> > >> * Michael Snoyman <michael at snoyman.com> [2014-08-27 23:48:06+0300]
> > >> > > The problem is the following Sink, which counts how many even/odd
> > >> Tokens
> > >> > > are
> > >> > > seen:
> > >> > >
> > >> > >   type SinkState = (Integer, Integer)
> > >> > >
> > >> > >   sink2 :: (Monad m) => SinkState -> Sink Token m SinkState
> > >> > >   sink2 state@(!evenCount, !oddCount) = do
> > >> > >     maybeToken <- await
> > >> > >     case maybeToken of
> > >> > >       Nothing     -> return state
> > >> > >       (Just Even) -> sink2 (evenCount + 1, oddCount    )
> > >> > >       (Just Odd ) -> sink2 (evenCount    , oddCount + 1)
> > >> >
> > >> > Wow, talk about timing! What you've run into here is expensive
> monadic
> > >> > bindings. As it turns out, this is exactly what my blog post from
> last
> > >> > week[1] covered. You have three options to fix this:
> > >> >
> > >> > 1. Just upgrade to conduit 1.2.0, which I released a few hours ago,
> and
> > >> > uses the codensity transform to avoid the problem. (I just tested
> your
> > >> > code; you get constant memory usage under conduit 1.2.0, seemingly
> > >> without
> > >> > any code change necessary.)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20140828/cac0600f/attachment.html>


More information about the Haskell-Cafe mailing list