[Haskell-cafe] Re: Haskell-Cafe Digest, Vol 78, Issue 14

Thu Feb 11 14:27:27 EST 2010

On Thu, 2010-02-11 at 11:00 -0500, Gregory Collins wrote:
> Maciej Piechotka <uzytkownik2 at gmail.com> writes:
> 
> > On Tue, 2010-02-09 at 16:41 +0000, John Lato wrote:
> >> 
> >> See http://inmachina.net/~jwlato/haskell/ParsecIteratee.hs for a valid
> >> Stream instance using iteratee.  Also Gregory Collins recently posted
> >> an iteratee wrapper for Attoparsec to haskell-cafe.  To my knowledge
> >> these are not yet in any packages, but hackage is vast. 
> >
> > Hmm. Am I correct that his implementation caches everything? 
> 
> The one that John posted (iteratees on top of parsec) has to keep a copy
> of the entire input, because parsec wants to be able to do arbitrary
> backtracking on the stream.
> 

Well. Not quite. AFAIU (and ByteString implementation indicate so) the
uncons have a type
    uncons :: s -> m (Maybe (t, s))

Where s indicates the position on the stream. Since it is impossible to
get back from having s alone the GC should be free to finalize all
memory allocated to cache the stream before the first living s.

I.e. if input is:

text = 'L':'o':'r':'e':'m':' ':'i':'p':'s':'u':'m':[]
                    ^               ^
                    s1              s2

and s1 and s2 are position in the stream (for stream that is list)
GC can free Lor part. It seems that it might be significant in real live
as try calls are relatively short comparing with rest of code.

By keeping s as 'pointer to' element second uncons have O(1) time
instead of O(n).

> 
> > I tried to rewrite the implementation using... well imperative linked
> > list. For trivial benchmark it have large improvement (althought it may
> > be due to error in test such as using ByteString) and, I believe, that
> > it allows to free memory before finish.
> >
> > Results of test on Core 2 Duo 2.8 GHz:
> > 10:	0.000455s	0.000181s
> > 100:	0.000669s	0.001104s
> > 1000:	0.005209s	0.023704s
> > 10000:	0.053292s	1.423443s
> > 100000:	0.508093s	132.208597s
> 
> Which column corresponds to which module here, and which module are you
> benchmarking against, John's or mine?
> 
> G

As I'm implementing for parsec (I don't know attoparsec) [as a kind of
exercise to get iteratee better] I benchmarked against John's. My
results are on the left.

I forgot to compile and optimize. Here's result for ByteString:
        Mine            John's
10:	0.000425s	0.000215s
100:	0.000616s	0.001963s
1000:	0.0041s		0.048359s
10000:	0.041694s	4.492774s
100000:	0.309289s	434.238449s

And []:
        Mine            John's
10:	0.000605s	0.000932s
100:	0.001464s	0.008101s
1000:	0.004036s	0.054125s
10000:	0.032341s	1.36938s
100000:	0.317859s	115.846891s

Regards and sorry for confusion 

PS. Sorry - I know that test is somehow simplistic but I thing it
emulates real live situation where you are interested in small amount of
elements around current position (short try). I also think
that /dev/zero have relatively predictable access time (it does not need
to be loaded from slow disk first after which it can be accessed from
cache, it does not run out of entropy etc.).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20100211/0fa77b47/attachment.bin