[Haskell-cafe] iteratee-compress space leak?

Mon Feb 21 14:22:14 CET 2011

Hi Maciej,

Thanks for looking in to this.

> From: Maciej Piechotka <uzytkownik2 at gmail.com>
>
> On Fri, 2011-02-18 at 17:27 +0300, Michael A Baikov wrote:
> > I am trying to play with iteratee making parser for squid log files, but
> found that my code do not run in constant space when it tries to process
> compressed log files. So i simplified my code down to this snippet:
> >
> > import Data.ByteString (ByteString)
> > import Data.Iteratee as I
> > import Data.Iteratee.Char
> > import Data.Iteratee.ZLib
> > import System
> >
> > main = do
> >         args <- getArgs
> >         let fname     = args !! 0
> >         let blockSize = read $ args !! 1
> >
> >         fileDriver (leak blockSize) fname >>= print
> >
> > leak :: Int -> Iteratee ByteString IO ()
> > leak blockSize = joinIM $ enumInflate GZip defaultDecompressParams
> chunkedRead
> >     where
> >         consChunk :: Iteratee ByteString IO String
> >         consChunk = (joinI $ I.take blockSize I.length) >>= return . show
> >
> >         chunkedRead :: Iteratee ByteString IO ()
> >         chunkedRead = joinI $ convStream consChunk printLines
> >
> >
> > First argument - file name (/var/log/messages.1.gz will do)
> > second - size of block to consume input. with low size (10 bytes) of
> consumed blocks it leaks very fast, with larger blocks (~10000) it works
> almost without leaks.
> >
> > So. Is it bugs within my code, or iteratee-compress should behave
> differently?
>
> After looking into problem (or rather onto your code) - the problem have
> nothing to do with iteratee-compress I believe. I get similar behaviour
> and results when I replace "joinIM $ enumInflate GZip
> defaultDecompressParams chunkedRead" by chunkedRead. (The memory is
> smaller but it is due to decompression not iteratee fault).
>

This is due to "printLines".  Whether it's a bug depends on what the correct
behavior of "printLines" should be.

"printLines" currently only prints lines that are terminated by an EOL
(either "\n" or "\r\n").  This means that it needs to hold on to the entire
stream received until it finds EOL, and then prints the stream, or drops it
if it reaches EOF first.  In your case, the stream generated by "convStream
consChunk printLines" is just a stream of numbers without any EOL, where the
length is dependent on the specified block size.  This causes the space
leak.

If I change the behavior of "printLines" to print lines that aren't
terminated by EOL, the leak could be fixed.  Whether that behavior is more
useful than the present, I don't know.  Alternatively, if you insert some
newlines into your stream this could be improved as well.

As a result of investigating this, I realized that
Data.Iteratee.ListLike.break can be very inefficient in cases where the
predicate is not satisfied relatively early. I should actually provide an
enumeratee interface for it.  So thanks very much for (indirectly)
suggesting that.

Cheers,
John L
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110221/0e6efc4c/attachment.htm>