[Haskell-cafe] Conduit+GHC high memory use for simple Sink
Bryan Vicknair
bryanvick at gmail.com
Thu Aug 28 05:58:26 UTC 2014
Thanks for the interesting blog posts Michael. I updated the example project
[1] to use conduit 1.2. Unfortunately, on my machine [2], my original sink2
still uses about 500Mb of memory when processing 4 gzip files of about 5Mb
each, while sink1 only uses about 8Mb. I added sink3, which does the same as
sink2 but uses fold from Conduit.List as you recommended, and that seems to
work, using about 8Mb.
Looking at the code for sink2 vs sink3, I don't understand what would be
occupying so much memory in sink2 even in the case of expensive monadic
binding, or exclusion from stream fusion. I'm curious if sink2 adds thunks to
the heap that sink3 doesn't, or if the GC is failing to clean up heap objects
in sink2 that is cleans up in sink3. I'm new at memory profiling, but the
chart I get with '+RTS -h' or '+RTS -hr' basically just tells me that the
action function is expensive.
In the real project that inspired this example I'm going to do some cleanup,
replacing manual recursion with higher-level functions from Conduit.List, as
that seems like an all around good idea.
Bryan Vicknair
[1] https://bitbucket.org/bryanvick/conduit-mem
[2] GHC 7.8.3, Arch Linux 3.16.1 kernel x86-64
On Thu, Aug 28, 2014 at 07:00:41AM +0300, Michael Snoyman wrote:
<snip>
> But looking at the code again with fresher eyes than last night: I really
> don't understand why it had such abysmal performance. I'll look into this a
> bit more, looks like it should be interesting.
>
>
> On Thu, Aug 28, 2014 at 1:39 AM, Dan Burton <danburton.email at gmail.com>
> wrote:
>
> > Michael, I don't see how your code sample for (3) is any different to the
> > compiler than Roman's original sink2.
> >
> > I also don't see how the original sink2 creates a bad bind tree. I presume
> > that the reason "fold" works is due to the streaming optimization rule, and
> > not due to its implementation, which looks almost identical to (3).
> >
> > I worry about using fold in this case, which is only strict up to WHNF,
> > and therefore wouldn't necessarily force the integers in the tuples;
> > instead it would create tons of integer thunks, wouldn't it? Roman's
> > hand-coded sink2 avoids this issue so I presume that's not what is causing
> > his memory woes.
> >
> > -- Dan Burton
> >
> >
> > On Wed, Aug 27, 2014 at 2:55 PM, Roman Cheplyaka <roma at ro-che.info> wrote:
> >
> >> * Michael Snoyman <michael at snoyman.com> [2014-08-27 23:48:06+0300]
> >> > > The problem is the following Sink, which counts how many even/odd
> >> Tokens
> >> > > are
> >> > > seen:
> >> > >
> >> > > type SinkState = (Integer, Integer)
> >> > >
> >> > > sink2 :: (Monad m) => SinkState -> Sink Token m SinkState
> >> > > sink2 state@(!evenCount, !oddCount) = do
> >> > > maybeToken <- await
> >> > > case maybeToken of
> >> > > Nothing -> return state
> >> > > (Just Even) -> sink2 (evenCount + 1, oddCount )
> >> > > (Just Odd ) -> sink2 (evenCount , oddCount + 1)
> >> >
> >> > Wow, talk about timing! What you've run into here is expensive monadic
> >> > bindings. As it turns out, this is exactly what my blog post from last
> >> > week[1] covered. You have three options to fix this:
> >> >
> >> > 1. Just upgrade to conduit 1.2.0, which I released a few hours ago, and
> >> > uses the codensity transform to avoid the problem. (I just tested your
> >> > code; you get constant memory usage under conduit 1.2.0, seemingly
> >> without
> >> > any code change necessary.)
More information about the Haskell-Cafe
mailing list