[Haskell-cafe] Re: Brainstorming on how to parse IMAP

Tue Aug 5 16:49:24 EDT 2008

I am glad you asked Ben,

Short answer: It can return a Seq of your values. The values in the Seq are 
lazy, the Seq itself is finite.  It can return what it has so far before it 
finishes parsing (or even before the rest of the input has arrived from the 
network).

Ben Franksen wrote:
> ChrisK wrote:
>>    I recently posted new and fancy binary Get monads in
>> http://article.gmane.org/gmane.comp.lang.haskell.libraries/9691
>> and
>> http://article.gmane.org/gmane.comp.lang.haskell.libraries/9756
>> which might be of interest since network protocol are usually specified in
>> bytes at the wire level.
>>
>>    The latest one takes input which may or may not be complete and returns
>>    a
>> stream (a Seq) of results.
> 
> IIRC Seq is not a 'Stream' but a strict sequence? Or do you meant 'a stream
> (of Seq)'?

I meant it returns many (Seq y), one after the other, while doing parsing in 
between.

> 
> Cheers
> Ben

Long answer:

The complicated parser looks like this.  Start with the run function:

runCompGet :: (Monad m,Monoid w)
             => CompGet a y r w user m a
             -> r -> user -> L.ByteString
             -> m (CompResult y w user m a)

This takes a CompGet and a reader state r and a user state user and the 
(initial) input L.ByteString (Data.ByteString.Lazy.ByteString).  It evaluates to 
  the inner monad 'm' returning a CompResult.

The CompResult is a three-fold type:

> data CompResult y w user m a =
>     CFailed (Seq y) !Int64 String
>   | CFinished  (Seq y) !L.ByteString !Int64 w user a
>   | CPartial (Seq y) (Either ( m (CompResult y w user m a) )
>                                  ( Maybe L.ByteString -> m (CompResult y w user m a) ))

All three have (Seq y) which are the Data.Sequence.Seq of things which have been 
queued by "yieldItem".

CFailed also has the Int64 count of bytes parsed successfully and an error 
message String.  Nothing more can be done.

CFinished also has the unused tail of the input as a L.ByteString and an Int64
of the bytes consumed.  And the output of the writer w, the final user state, 
and lastly it has the end value returned by the computation which has type 'a'. 
  Nothing more can be done.

CPartial is the intermediate result.  It also carries Either:
  Left : the rest of the computation, currently suspended, to continue running.
  Right: a function from (Maybe ByteString) to the suspended computation.

The Left is a result of the "flushItems" command and is merely a way to return 
the (Seq y) so far before continuing.

The Right is a result of running out of input data.  This allows the program to 
feed more input into the parser which will be appended to all the previous 
input. One does this by passing (Just someByteString) to the function.  If the 
parser again runs out of data it will again return CPartial with a Right value.

Alternatively, one can pass Nothing.  This tells the parser that there will 
never ever be more input. The parser will never ask for (though it may 
flushItems and return a Right valued CPartial).

A key thing about the (Seq y) is that yielded items are only returned once.  The 
CPartial may be returned many times and each time it will have an empty list or 
fresh list of (Seq y).  The values in the Seq are lazy, the Seq itself is 
finite.  To collect all the value the caller has to concatenate all the (Seq 
y)'s that are returned during parsing.

As for parsing, the module offers the usual BinaryParser interface (package 
binary-strict) and has an interface which mostly overlaps Data.Binary.Get 
(package binary). For example: it has "getByteString" and "getWord64be" and 
"getStorable".

You don't have to use the "yieldItem" command.  You can just return the results 
in the final "a" return value (or the "user" state or the writer "w" value).  In 
this situation you only get an answer when CFinished is returned (and nothing if 
CFailed is returned).  I could not use the writer mechanism for yield-ing 
because the "listen" and "pass" parts of the MonadWriter class ensure it has the 
wrong semantics.

You might wonder:
  *) If the parser code uses MonadPlus to give several alternatives
  *) The first alternative gets more input via CPartial (perhaps several times)
  *) The first alternative then fails
  *) The second alternative starts parsing from the same position the first did
  *) Does the second alternative see the new input passed to CPartial earlier?
The answer is yes.  Changes to the input stream by appending with CPartial 
affect the whole computation and are never rolled back.

You might wonder:
   *) If the first alternative calls "yieldItem foo"
   *) The first alternative fails
   *) The second alternative calls flushItems
   *) Does the CPartial (Seq y) contain "foo"?
The answer is yes.  Items yielded are never rolled back.

[ Doing all of this in the presence of throwError/mzero/fail and lookAhead* and 
callCC was interesting to code. ]

Cheers,
   Chris