A fancier Get monad or two (a la binary and binary-strict)

Thu Jul 31 08:00:22 EDT 2008

Johan Tibell wrote:
> Hi Chris,
> 
> Thanks for providing feedback. It's much appreciated.
> 
> On Wed, Jul 30, 2008 at 11:34 PM, Chris Kuklewicz
> <haskell at list.mightyreason.com> wrote:
>> The differences:
>>  Your error handling is via Alternative, and if the first branch advances
>> (consumes input) then the second branch is not attempted.  The state can
>> only go forward (by 1 byte) or remain in place (there is no look-ahead).  If
>> reading past the end, then either (*) the new first byte works (*) the new
>> first fails and the Alternative is ready to try it.
> 
> This is by design. This is enough for parsing many network protocols
> like HTTP and avoids having to deal with backtracking and the
> potential space leaks and bookkeeping overhead that might come with
> it.

I agree that the design space of get and parsing monads is quite broad. That is 
why my small endeavor now has 3 monads: MyGetW with MonadCont, MyGet without 
MonadCont, and MyGetSimplified without MonadCont,Reader,Writer,State,Trans.

> 
>>  The management of saved state on the stack of pending operations is much
>> simpler with your commit-if-advance semantics and much more complicated with
>> my rollback semantics.  Oddly, it seems your committed operations do not
>> immediately release the pending handlers so they can be garbage collected.
>>  This same kind of issue motivated me to improve the implementation of
>> binary-strict's incremental get.
> 
> I'm not sure what you mean by releasing pending handlers. It's
> probably something I've not considered. Could you please elaborate?

Absolutely.  It does not affect the semantics of the computation, but it does 
affect the space performance.  Your Hyena.Parser has a state, a success 
continuation, and a failure continuation.  The state has a count of the number 
of bytes consumed.  The Alterantive <|> operation is the interesting one:

> data S = S S.ByteString Int64 Bool

> newtype Parser a = Parser
>     { unParser :: forall r.
>                   S -> (a -> S -> IResult r) -> (S -> IResult r) -> IResult r }

> instance Alternative Parser where
>     empty = Parser $ \s _ fail -> fail s
> 
>     p <|> p' = Parser $ \s@(S _ pos _) succ fail ->
>                unParser p s succ $ \s'@(S _ pos' _) ->
>                    if pos == pos'
>                    then unParser p' s' succ fail
>                    else fail s'

I will rewrite this for clarity, and define the function fail' as

>     p <|> p' = Parser $ \s@(S _ pos _) succ fail ->
>                  let fail' s'@(S _ pos' _) = if pos == pos'
>                                                then unParser p' s' succ fail
>                                                else fail s'
>                  in unParser p s succ fail'

It is now clear that <|> works by allocating fail' which hold a reference to the 
original pos and succ and fail.  Happily, it does not hold a reference to the 
original S.ByteString because your handlers have commit-if-advance semantics.

The commit-if-advance semantics come from (if pos == pos') which checks to see 
whether p has consumed any of the input.  In practice pos <= pos' will always 
hold, since p cannot have gone backwards.

When does fail' get released?  It gets released after p fails by calling it or 
after p and thus <|> succeeds.  But fail' reverts in behavior to fail as soon as 
p advances pos.  So Parser keeps fail' longer than needed.  For example: if p 
uses bytes then even though bytes advances for a while the fail' is not released 
until it hits an error or p finishes.

The core problem is the error handlers in Hyena.Parser are written in the form 
of nested scopes and this does not model their semantics.  They behave like the 
Bool in the state and thus 'fail' can be moved into 'S' ...  which I have done 
in the attached file.

The meat of the difference is in 'satisfy'.  When pos is incemented to (pos+1) 
the (Maybe fail) is set to Nothing which signifies that the top-level failure 
continuation 'failed' is in effect.  The 'fail' functions, which are allocated 
in <|> as before, are thus immediately released when the parser advances.

Whether this makes it better or not depends on how deep the nested stack <|> 
calls become.  A very deep stack had many fail' waiting in vain under the 
original Hyena.Parser and are released much much sooner under my version.

[ Also, I deleted IPartial. ]

Cheers,
   Chris K
-------------- next part --------------
{-# LANGUAGE Rank2Types #-}

-- modified by Chris Kuklewicz
--   * remove IResult
--   * move fail continuation to state S, inside Maybe [changes S to (S r)]
--   * clear fail to Nothing when advancing
--   * use new 'callFailed' to invoke failure continuation

------------------------------------------------------------------------
-- |
-- Module      :  Hyena.Parser
-- Copyright   :  (c) Johan Tibell 2008
-- License     :  BSD3-style (see LICENSE)
--
-- Maintainer  :  johan.tibell at gmail.com
-- Stability   :  experimental
-- Portability :  portable
--
-- An incremental LL(1) parser combinator library.
--
------------------------------------------------------------------------

module Hyena.Parser
    (
      -- * The Parser type
      Parser,
      Result(..),
      runParser,

      -- * Primitive parsers
      satisfy,
      byte,
      bytes,

      module Control.Applicative
    ) where

import Control.Applicative
import qualified Data.ByteString as S
import Data.Int (Int64)
import Data.Word (Word8)
import Prelude hiding (fail, succ)

import Text.Show.Functions
import Debug.Trace

-- ---------------------------------------------------------------------
-- The Parser type

-- | The parse state.
data S r = S
    {-# UNPACK #-} !S.ByteString
    {-# UNPACK #-} !Int64
    {-# UNPACK #-} !Bool
    {-# UNPACK #-} !(Maybe (S r -> Result r))  -- error hander moved to state
         deriving Show

setFail :: S r -> Maybe (S r -> Result r) -> S r
setFail (S a b c _) mfail = (S a b c mfail)

getFail :: S r -> Maybe (S r -> Result r)
getFail (S _ _ _ mfail) = mfail

clearFail :: S r -> S r
clearFail (S a b c _) = S a b c Nothing

-- | A parse either succeeds, fails or returns a suspension with which
-- the parsing can be resumed.
data Result a = Finished a S.ByteString
              -- ^ Parsing succeeded and produced a value of type
              -- @a at . The returned 'S.ByteString' is the remaining
              -- unconsumed input.
              | Failed Int64
              -- ^ Parsing failed at the given position. Either
              -- because the parser didn't match the input or because
              -- an unexpected end of input was reached during
              -- parsing.
              | Partial (Maybe S.ByteString -> Result a)
              -- ^ The parsing needs more input to continue. Pass in
              -- @Just input@ to continue parsing and @Nothing@ to
              -- signal end of input. If @Nothing@ is passed the
              -- 'Result' is either 'Finished' or 'Failed'.
  deriving Show

-- | A parser takes a parse state, a success continuation and returns a 'Result'.
newtype Parser a = Parser
    { unParser :: forall r. S r -> (a -> S r -> Result r) -> Result r }

-- ---------------------------------------------------------------------
-- Instances

instance Functor Parser where
    fmap f p = Parser $ \s succ -> unParser p s (succ . f)
    {-# INLINE fmap #-}

instance Applicative Parser where
    pure a = Parser $ \s succ -> succ a s
    {-# INLINE pure #-}

    p <*> p' = Parser $ \s succ ->
                 let newSucc f s' = unParser p' s' (succ . f)
                 in unParser p s newSucc
    {-# INLINE (<*>) #-}

instance Alternative Parser where
    empty = Parser $ \s _ -> callFailed s
    {-# INLINE empty #-}

    p <|> p' = Parser $ \s@(S _ _ _ mfail) succ ->
                 let newS = setFail s (Just (\s' -> unParser p' (setFail s' mfail) succ))
                 in unParser p newS succ
    {-# INLINE (<|>) #-}

-- ---------------------------------------------------------------------
-- Running a parser

initState :: S.ByteString -> S r
initState bs = S bs 0 False Nothing
{-# INLINE initState #-}

-- | This is the final continuation that turns a successful parse into
-- a 'Result'.
finished :: r -> S r -> Result r
finished v (S bs _ _ _) = Finished v bs

-- | This is the final continuation that turns an unsuccessful parse
-- into a 'Result'.
failed :: S r -> Result a
failed (S _ pos _ _) = Failed pos

-- | TODO: Write documentation.
runParser :: Parser a -> S.ByteString -> Result a
runParser p bs = unParser p (initState bs) finished

callFailed :: S r -> Result r
callFailed s@(S _ _ _ Nothing) = failed s
callFailed s@(S _ _ _ (Just fail)) = fail s

-- ---------------------------------------------------------------------
-- Primitive parsers

-- | The parser @satisfy p@ succeeds for any byte for which the
-- supplied function @p@ returns 'True'.  Returns the byte that is
-- actually parsed.
satisfy :: (Word8 -> Bool) -> Parser Word8
satisfy p =
    Parser $ \s@(S bs pos eof mfail) succ ->
        case S.uncons bs of
          Just (b, bs') -> if p b
                             then succ b (S bs' (pos + 1) eof Nothing) -- clear (Maybe fail) on advance
                             else callFailed s
          Nothing       -> if eof
                           then callFailed s
                           else Partial $ \x ->
                               case x of
                                 Just bs' -> retry (S bs' pos eof mfail)
                                 Nothing  -> callFailed (S bs pos True mfail)
            where retry s' = unParser (satisfy p) s' succ

-- | @byte b@ parses a single byte @b at .  Returns the parsed byte
-- (i.e. @b@).
byte :: Word8 -> Parser Word8
byte b = satisfy (== b)

-- | @bytes bs@ parses a sequence of bytes @bs at .  Returns the parsed
-- bytes (i.e. @bs@).
bytes :: S.ByteString -> Parser S.ByteString
bytes = fmap S.pack . go
    where
      go bs = case S.uncons bs of
                Just (b, bs') -> liftA2 (:) (byte b) (go bs')
                Nothing       -> pure []