[Haskell-cafe] Re: Lazy Parsing

Sun May 31 07:41:44 EDT 2009

> It is my pleasure to announce that after 5 days of experimenting  
> with uu-parsinglib I have absolutely no clue, whatsoever, on how to  
> use it.
>
> I do not even manage to write a parser for even a mere digit or a  
> simple character.

I don't know whether you will be willing to change over to polyparse  
library, but here are some hints about how you might use it.

Given that you want the input to be a simple character stream, rather  
than use a more elaborate lexer, the first thing to do is to  
specialise the parser type for your purposes:

 > type TextParser a = Parser Char a

Now, to recognise a "mere digit",

 > digit :: TextParser Char
 > digit = satisfy Char.isDigit

and for a sequence of digits forming an unsigned integer:

 > integer :: TextParser Integer
 > integer = do ds <- many1 digit
 >              return (foldl1 (\n d-> n*10+d)
 >                             (map (fromIntegral.digitToInt) ds))
 >           `adjustErr` (++("expected one or more digits"))

> I mean I'd like to be able to turn "12.05.2009" into something like  
> (12, 5, 2009) and got no clue what the code would have to look like.  
> I do know almost every variation what the code must not look like :).

 > date = do a <- integer
 >           satisfy (=='.')
 >           b <- integer
 >           satisfy (=='.')
 >           c <- integer
 >           return (a,b,c)

Of course, that is just the standard (strict) monadic interface used  
by many combinator libraries.  Your original desire was for lazy  
parsing, and to achieve that, you must move over to the applicative  
interface.  The key difference is that you cannot name intermediate  
values, but must construct larger values directly from smaller ones by  
something like function application.

 > lazydate = return (,,) `apply` integer `discard` dot
 >                        `apply` integer `discard` dot
 >                        `apply` integer
 >    where dot = satisfy (=='.')

The (,,) is the constructor function for triples.  The `discard`  
combinator ensures that its second argument parses OK, but throws away  
its result, keeping only the result of its first argument.

Apart from lazy space behaviour, the main observable difference  
between "date" and "lazydate" is when errors are reported on incorrect  
input.  For instance:

   > fst $ runParser date "12.05..2009"
   *** Exception: In a sequence:
   Parse.satisfy: failed
   expected one or more digits

   > fst $ runParser lazydate "12.05..2009"
   (12,5,*** Exception: In a sequence:
   Parse.satisfy: failed
   expected one or more digits

Notice how the lazy parser managed to build the first two elements of  
the triple, whilst the strict parser gave no value at all.

I know that the error messages shown here are not entirely  
satisfactory, but they can be improved significantly just by making  
greater use of the `adjustErr` combinator in lots more places (it is  
rather like Parsec's <?>).  Errors containing positional information  
about the input can be constructed by introducing a separate lexical  
tokenizer, which is also not difficult.

Regards,
     Malcolm