[Haskell-cafe] Re: Lazy Parsing

Sun May 31 08:38:00 EDT 2009

Dear Malcom,

thanks for helping.

I had actually come to Haskell originally because of a parsing problem. 
I had been using Smalltalk until I started a project which required 
parsing files. Until then I had not done any RW parsing.

Well the route was more a Parsec -> Haskell, wtf is Haskell? Anyway 
eventually I dropped Smalltalk and got addicted to Haskell. And managed 
  familiarize myself with Haskell and Parsec, the latter as it turned 
out I didn't even need to solve my original problem.

Anyway polyparse certainly is an option, but there are a few things that 
despite my "list of failures" to use it give uu-parsinglib a special 
appeal, the breadth-first approach with choice, I find that terrible 
elegant. Due to some kicks in my behind it seems that I might be able to 
use Doaitse's combinators now, some more details on that are in another 
post.

Günther

Malcolm Wallace schrieb:
>> It is my pleasure to announce that after 5 days of experimenting with 
>> uu-parsinglib I have absolutely no clue, whatsoever, on how to use it.
>>
>> I do not even manage to write a parser for even a mere digit or a 
>> simple character.
> 
> I don't know whether you will be willing to change over to polyparse 
> library, but here are some hints about how you might use it.
> 
> Given that you want the input to be a simple character stream, rather 
> than use a more elaborate lexer, the first thing to do is to specialise 
> the parser type for your purposes:
> 
>  > type TextParser a = Parser Char a
> 
> Now, to recognise a "mere digit",
> 
>  > digit :: TextParser Char
>  > digit = satisfy Char.isDigit
> 
> and for a sequence of digits forming an unsigned integer:
> 
>  > integer :: TextParser Integer
>  > integer = do ds <- many1 digit
>  >              return (foldl1 (\n d-> n*10+d)
>  >                             (map (fromIntegral.digitToInt) ds))
>  >           `adjustErr` (++("expected one or more digits"))
> 
>> I mean I'd like to be able to turn "12.05.2009" into something like 
>> (12, 5, 2009) and got no clue what the code would have to look like. I 
>> do know almost every variation what the code must not look like :).
> 
>  > date = do a <- integer
>  >           satisfy (=='.')
>  >           b <- integer
>  >           satisfy (=='.')
>  >           c <- integer
>  >           return (a,b,c)
> 
> Of course, that is just the standard (strict) monadic interface used by 
> many combinator libraries.  Your original desire was for lazy parsing, 
> and to achieve that, you must move over to the applicative interface.  
> The key difference is that you cannot name intermediate values, but must 
> construct larger values directly from smaller ones by something like 
> function application.
> 
>  > lazydate = return (,,) `apply` integer `discard` dot
>  >                        `apply` integer `discard` dot
>  >                        `apply` integer
>  >    where dot = satisfy (=='.')
> 
> The (,,) is the constructor function for triples.  The `discard` 
> combinator ensures that its second argument parses OK, but throws away 
> its result, keeping only the result of its first argument.
> 
> Apart from lazy space behaviour, the main observable difference between 
> "date" and "lazydate" is when errors are reported on incorrect input.  
> For instance:
> 
>   > fst $ runParser date "12.05..2009"
>   *** Exception: In a sequence:
>   Parse.satisfy: failed
>   expected one or more digits
> 
>   > fst $ runParser lazydate "12.05..2009"
>   (12,5,*** Exception: In a sequence:
>   Parse.satisfy: failed
>   expected one or more digits
> 
> Notice how the lazy parser managed to build the first two elements of 
> the triple, whilst the strict parser gave no value at all.
> 
> I know that the error messages shown here are not entirely satisfactory, 
> but they can be improved significantly just by making greater use of the 
> `adjustErr` combinator in lots more places (it is rather like Parsec's 
> <?>).  Errors containing positional information about the input can be 
> constructed by introducing a separate lexical tokenizer, which is also 
> not difficult.
> 
> Regards,
>     Malcolm