[Haskell-cafe] Re: Haskell-Cafe Digest, Vol 78, Issue 14

Fri Feb 12 07:51:13 EST 2010

On Fri, Feb 12, 2010 at 10:32 AM, Maciej Piechotka
<uzytkownik2 at gmail.com> wrote:
> On Thu, 2010-02-11 at 17:49 -0600, John Lato wrote:
>> On Thu, Feb 11, 2010 at 10:00 AM, Gregory Collins
>> <greg at gregorycollins.net> wrote:
>> > Maciej Piechotka <uzytkownik2 at gmail.com> writes:
>> >
>> >> On Tue, 2010-02-09 at 16:41 +0000, John Lato wrote:
>> >>>
>> >>> See http://inmachina.net/~jwlato/haskell/ParsecIteratee.hs for a valid
>> >>> Stream instance using iteratee.  Also Gregory Collins recently posted
>> >>> an iteratee wrapper for Attoparsec to haskell-cafe.  To my knowledge
>> >>> these are not yet in any packages, but hackage is vast.
>> >>
>> >> Hmm. Am I correct that his implementation caches everything?
>> >
>> > The one that John posted (iteratees on top of parsec) has to keep a copy
>> > of the entire input, because parsec wants to be able to do arbitrary
>> > backtracking on the stream.
>>
>> This is true, however I believe this alternative approach is also
>> correct.  The Cursor holds the stream state, and parsec holds on to
>> the Cursor for backtracking.  Data is only read within the Iteratee
>> monad when it goes beyond the currently available cursors, at which
>> point another cursor is added to the linked list (implemented with
>> IORef or other mutable reference).
>>
>> The downside to this approach is that data is consumed from the
>> iteratee stream for a partial parse, even if the parse fails.  I did
>> not want this behavior, so I chose a different approach.
>>
>
> Hmm. AFAIU your code you are doing something like:
>
>> concatCursor :: (Monad m, Reference r m, StreamChunk c el)
>>              => Cursor r m c el -> m (c el)
>> concatCursor c = liftM mconcat (concatCursor' c)
>>
>> concatCursor' :: (Monad m, Reference r m, StreamChunk c el)
>>               => Cursor r m c el -> m [c el]
>> concatCursor' (Cursor r v) =
>>   liftM2 (:) (return v) (readRef r >>= concatNextCursor')
>>
>> concatNextCursor' :: (Monad m, Reference r m, StreamChunk c el)
>>                  => NextCursor r m c el -> m [c el]
>> concatNextCursor' (NextCursor c) = concatCursor' $! c
>> concatNextCursor' _              = return $! []
>>
>> parsecIteratee' :: (Monad m, Reference r m, StreamChunk c el)
>>                 => ParsecT (Cursor r m c el) u (IterateeG c el m) a
>>                 -> u
>>                 -> SourceName
>>                 -> IterateeG c el m (Either ParseError a)
>> parsecIteratee' p u sn = do
>>    c <- lift $ mkCursor :: IterateeG c el m (Cursor r m c el)
>>    res <- runParserT (liftM2 (,) p getInput) u sn c
>>    case res of
>>      Right (a, c) -> do sc <- lift $ concatCursor c
>>                         liftI $! Done (Right a) $! Chunk $! sc
>>      Left   err   -> return $ Left err
>
> Which seems that it should work (I just checked if it is suppose to
> compile). Unfortunately I need to work the clash between transformers
> and mtl).
>
> EDIT. Ops. sorry. It will not work. However it will (as it should)
> return the remaining part back to stream.

Yes, the remaining part will be returned, but the consumed portion is
lost.  I couldn't figure out how to solve that problem other than
cacheing everything.

>
>> >
>> >> I tried to rewrite the implementation using... well imperative linked
>> >> list. For trivial benchmark it have large improvement (althought it may
>> >> be due to error in test such as using ByteString) and, I believe, that
>> >> it allows to free memory before finish.
>> >>
>> >> Results of test on Core 2 Duo 2.8 GHz:
>> >> 10:   0.000455s       0.000181s
>> >> 100:  0.000669s       0.001104s
>> >> 1000: 0.005209s       0.023704s
>> >> 10000:        0.053292s       1.423443s
>> >> 100000:       0.508093s       132.208597s
>> >
>
>> I expected poor performance of my code for larger numbers of elements,
>> as demonstrated here.
>>
>
> I haven't tested for more then 1e5 (which was in comment).

Interesting.  I expect good performance as long as chunks don't need
to be concatenated.  The default chunk size is either 4096 or 8192 (I
don't remember ATM).  This also assumes that no intervening functions
(take, drop, etc.) alter the stream too significantly.  Testing 1e5
wouldn't do more than two concats, and particularly with bytestrings
shouldn't impose too much penalty.  List performance would be much
worse though.

Incidentally, performance of the WrappedByteString newtype is poor
relative to true bytestrings.  This will be fixed in the next major
release (due in maybe a month or so?)

>
>> I envisioned the usage scenario where parsers would be relatively
>> short (<20 chars), and most of the work would be done directly with
>> iteratees.  In this case it would be more important to preserve the
>> stream state in the case of a failed parse, and the performance issues
>> of appending chunks wouldn't arise either.
>>
>
> Fortunately parsec does not limit the number of streams per monad so it
> is up to user which one he will choose (depending on problem).
>

Good point.

>
> Regards
> PS. Why iteratee uses transformers? It seems to be identical (both have
> functional dependencies etc.) to mtl except that mtl is standard in
> platform. Using both lead to clashes between names.

Short answer: I am using iteratee with another package that uses transformers.
Longer answer: see discussions on mtl vs. transformers in the
haskell-libraries archive.

There are a few simple solutions.  You can build iteratee against mtl
by changing the build-depends: field in iteratee.cabal.  You can also
use the LANGUAGE PackageImports pragma.  I'm unaware of any
difficulties with either of these approaches.

Cheers,
John