[Haskell-cafe] Conduit+GHC high memory use for simple Sink

Wed Aug 27 20:48:06 UTC 2014

On Wed, Aug 27, 2014 at 9:19 PM, Bryan Vicknair <bryanvick at gmail.com> wrote:

> Hello Cafe,
>
> First I'd like to thank Michael Snoyman and Gabriel Gonzalez for the work
> they've done on the conduit and pipes stream processing libraries, and all
> the
> accompanying tutorial content.  I've been having fun converting a text
> processing app from C to Haskell.
>
> I'm seeing unexpectedly high memory usage in a stream-processing program
> that
> uses the conduit library.
>
> I've created a example that demonstrates the problem.  The program accepts
> gzip
> files as arguments, and for each one, classifies each line as either Even
> or
> Odd depending on the length, then outputs some result depending on the Sink
> used.  For each gzip file:
>
>   action :: GzipFilePath -> IO ()
>   action (GzipFilePath filePath) = do
>     result <- runResourceT $  CB.sourceFile filePath
>                            $$ Zlib.ungzip
>                            =$ CB.lines
>                            =$ token
>                            =$ sink2
>     putStrLn $ show result
>
> The problem is the following Sink, which counts how many even/odd Tokens
> are
> seen:
>
>   type SinkState = (Integer, Integer)
>
>   sink2 :: (Monad m) => SinkState -> Sink Token m SinkState
>   sink2 state@(!evenCount, !oddCount) = do
>     maybeToken <- await
>     case maybeToken of
>       Nothing     -> return state
>       (Just Even) -> sink2 (evenCount + 1, oddCount    )
>       (Just Odd ) -> sink2 (evenCount    , oddCount + 1)
>
> When I give this program a few gzip files, it uses hundreds of megabytes of
> resident memory.  When I give the same files as input, but use the
> following
> simple Sink, it only uses about 8Mb of resident memory:
>
>   sink1 :: MonadIO m => Sink Token m ()
>   sink1 = awaitForever (liftIO . putStrLn . show)
>
> At first I thought that sink2 performed so poorly because the addition
> thunks
> were being placed onto the heap until the end, so I added some bang
> patterns to
> make it strict.  That didn't help however.
>
> I've done profiling, but I'm unable to figure out exactly what is being
> added
> to the heap in sink2 but not sink1, or what is being garbage collected in
> sink1, but not sink2.
>
>
> The full source is here:
> https://bitbucket.org/bryanvick/conduit-mem/src/HEAD/hsrc/bin/mem.hs
>
> Or you can clone the repo, which contains a cabal file for easy building:
>
>   git clone git at bitbucket.org:bryanvick/conduit-mem.git
>   cd comduit-mem
>   cabal sandbox init
>   cabal install --only-dependencies
>   cabal build mem
>   ./dist/build/mem/mem [GIVE SOME GZIP FILES HERE]
>
> You can change which sink is used in the 'action' function to see the
> different
> memory usage. <http://www.haskell.org/mailman/listinfo/haskell-cafe>

Wow, talk about timing! What you've run into here is expensive monadic
bindings. As it turns out, this is exactly what my blog post from last
week[1] covered. You have three options to fix this:

1. Just upgrade to conduit 1.2.0, which I released a few hours ago, and
uses the codensity transform to avoid the problem. (I just tested your
code; you get constant memory usage under conduit 1.2.0, seemingly without
any code change necessary.)
2. Instead of writing your `sink2` as you have, express it in terms of
Data.Conduit.List.fold, which associates the right way. This looks like:

    fold add (0, 0)
      where
        add (!x, !y) Even = (x + 1, y)
        add (!x, !y) Odd = (x, y + 1)
3. Allow conduit 1.1's rewrite rules to emulate the same behavior and
bypass the expensive monadic bind. This can be done by replacing your
current `await` with "await followed by bind", e.g.:

sink2 :: (Monad m) => SinkState -> Sink Token m SinkState
sink2 state@(!evenCount, !oddCount) = do
  await >>= maybe (return state) go
 where
  go Even = sink2 (evenCount + 1, oddCount    )
  go Odd  = sink2 (evenCount    , oddCount + 1)

I'd definitely recommend (1). I'd *also* recommend using (2), as the built
in functions will often times be more efficient than something hand-rolled,
especially now that stream fusion is being added[2]. With conduit 1.2, step
(3) *will* be a bit more efficient still (it avoids create an extra Maybe
value), but not in a significant way.

Michael

[1] https://www.fpcomplete.com/blog/2014/08/iap-speeding-up-conduit
[2] https://www.fpcomplete.com/blog/2014/08/conduit-stream-fusion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20140827/bc0a1822/attachment.html>