[Haskell-cafe] Conduit+GHC high memory use for simple Sink
Michael Snoyman
michael at snoyman.com
Thu Aug 28 06:31:41 UTC 2014
Well, I got it down to a case that depends on only base, and uses its own
local implementation of a minimal conduit:
http://lpaste.net/110126
But I'm not certain if this is still reproducing the original issue, or if
the list getting lifted out here is a different issue.
On Thu, Aug 28, 2014 at 8:56 AM, Michael Snoyman <michael at snoyman.com>
wrote:
> I actually just got to an interesting result: sink2 is a red herring.
> Consider the following program:
>
> import Control.Monad.IO.Class ( liftIO )
> import Data.Conduit.Internal (ConduitM (..), Pipe (..), (>+>), runPipe,
> awaitForever)
>
> main :: IO ()
> main = runPipe $
> (HaveOutput (Done ()) (return ()) ()) >+>
> awaitForever (\_ -> liftIO $ lengthM 0 [1..10000000 :: Int] >>= print)
>
> lengthM :: Monad m => Int -> [a] -> m Int
> lengthM cnt [] = return cnt
> lengthM cnt (_:xs) =
> cnt' `seq` lengthM cnt' xs
> where
> cnt' = cnt + 1
>
>
> On my machine, it takes 375MB of memory. What appears to be the cause is
> that GHC is keeping the entire representation of `lengthM` in memory, which
> is clearly a pessimization. I still need to research this further, but I
> thought you'd want to see these results now. (Plus, maybe someone else has
> some other ideas.)
>
> In case anyone wants, the core for this code is available at:
>
> http://lpaste.net/110125
>
> Michael
>
>
> On Thu, Aug 28, 2014 at 8:58 AM, Bryan Vicknair <bryanvick at gmail.com>
> wrote:
>
>> Thanks for the interesting blog posts Michael. I updated the example
>> project
>> [1] to use conduit 1.2. Unfortunately, on my machine [2], my original
>> sink2
>> still uses about 500Mb of memory when processing 4 gzip files of about 5Mb
>> each, while sink1 only uses about 8Mb. I added sink3, which does the
>> same as
>> sink2 but uses fold from Conduit.List as you recommended, and that seems
>> to
>> work, using about 8Mb.
>>
>> Looking at the code for sink2 vs sink3, I don't understand what would be
>> occupying so much memory in sink2 even in the case of expensive monadic
>> binding, or exclusion from stream fusion. I'm curious if sink2 adds
>> thunks to
>> the heap that sink3 doesn't, or if the GC is failing to clean up heap
>> objects
>> in sink2 that is cleans up in sink3. I'm new at memory profiling, but the
>> chart I get with '+RTS -h' or '+RTS -hr' basically just tells me that the
>> action function is expensive.
>>
>> In the real project that inspired this example I'm going to do some
>> cleanup,
>> replacing manual recursion with higher-level functions from Conduit.List,
>> as
>> that seems like an all around good idea.
>>
>>
>> Bryan Vicknair
>>
>> [1] https://bitbucket.org/bryanvick/conduit-mem
>> [2] GHC 7.8.3, Arch Linux 3.16.1 kernel x86-64
>>
>>
>> On Thu, Aug 28, 2014 at 07:00:41AM +0300, Michael Snoyman wrote:
>> <snip>
>> > But looking at the code again with fresher eyes than last night: I
>> really
>> > don't understand why it had such abysmal performance. I'll look into
>> this a
>> > bit more, looks like it should be interesting.
>> >
>> >
>> > On Thu, Aug 28, 2014 at 1:39 AM, Dan Burton <danburton.email at gmail.com>
>> > wrote:
>> >
>> > > Michael, I don't see how your code sample for (3) is any different to
>> the
>> > > compiler than Roman's original sink2.
>> > >
>> > > I also don't see how the original sink2 creates a bad bind tree. I
>> presume
>> > > that the reason "fold" works is due to the streaming optimization
>> rule, and
>> > > not due to its implementation, which looks almost identical to (3).
>> > >
>> > > I worry about using fold in this case, which is only strict up to
>> WHNF,
>> > > and therefore wouldn't necessarily force the integers in the tuples;
>> > > instead it would create tons of integer thunks, wouldn't it? Roman's
>> > > hand-coded sink2 avoids this issue so I presume that's not what is
>> causing
>> > > his memory woes.
>> > >
>> > > -- Dan Burton
>> > >
>> > >
>> > > On Wed, Aug 27, 2014 at 2:55 PM, Roman Cheplyaka <roma at ro-che.info>
>> wrote:
>> > >
>> > >> * Michael Snoyman <michael at snoyman.com> [2014-08-27 23:48:06+0300]
>> > >> > > The problem is the following Sink, which counts how many even/odd
>> > >> Tokens
>> > >> > > are
>> > >> > > seen:
>> > >> > >
>> > >> > > type SinkState = (Integer, Integer)
>> > >> > >
>> > >> > > sink2 :: (Monad m) => SinkState -> Sink Token m SinkState
>> > >> > > sink2 state@(!evenCount, !oddCount) = do
>> > >> > > maybeToken <- await
>> > >> > > case maybeToken of
>> > >> > > Nothing -> return state
>> > >> > > (Just Even) -> sink2 (evenCount + 1, oddCount )
>> > >> > > (Just Odd ) -> sink2 (evenCount , oddCount + 1)
>> > >> >
>> > >> > Wow, talk about timing! What you've run into here is expensive
>> monadic
>> > >> > bindings. As it turns out, this is exactly what my blog post from
>> last
>> > >> > week[1] covered. You have three options to fix this:
>> > >> >
>> > >> > 1. Just upgrade to conduit 1.2.0, which I released a few hours
>> ago, and
>> > >> > uses the codensity transform to avoid the problem. (I just tested
>> your
>> > >> > code; you get constant memory usage under conduit 1.2.0, seemingly
>> > >> without
>> > >> > any code change necessary.)
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20140828/f2304105/attachment.html>
More information about the Haskell-Cafe
mailing list