Lazy Parsing
Joe English
jenglish@flightlab.com
Thu, 28 Feb 2002 08:22:04 -0800
Brandon Michael Moore wrote:
> I'm wondering if there are any libraries out there for creating parsers
> that lazily build up their result. I know I could thread the remaining
> input through a parser by hand, but it seems like someone should have
> already done it.
This turns out to be rather difficult to do in the general case
(but see below -- XML is a special case).
If you have
type Parser sym result = [sym] -> Maybe (result, [sym])
a Parser can't decide whether to return 'Just (result,rest)'
or 'Nothing' until it has successfully parsed the complete result.
So pattern matching on the parser's return value will force
the entire production. Variations on the theme -- Either instead
of Maybe, list-of-successes, continuation-passing combinators, etc --
all face a similar problem.
However, if your top-level grammar is of the form:
things :: empty | thing things {- == thing* -}
then instead of:
case runParser (pMany pThing) input of Just (result,[]) -> ...
you can use something like
unfoldr (runParser pThing) input
to build the result list incrementally. This will be less eager;
instead of parsing and returning an entire list of Things, it
parses one Thing at a time.
Another thing to watch out for is heap drag. The list-of-successes
approach tends to retain the entire input, just in case the parser
needs to backtrack. Parsec [1] and UU_Parsing [?] solve this
by severely restricting the amount of required lookahead.
> I'd like to be able to turn a stream of XML into a lazy tree of tags
> (probably Maybe tags, or Either errors tags), but I don't think HaXml and
> the like do that sort of thing.
That's exactly how HXML [2] works. The parser returns a lazy
list of tokens (analogous to SAX events), which are folded up
into a tree by a separate function. In addition it uses a CPS
parser library so (as with Parsec), there is minimal heap drag.
[1] Parsec: <URL: http://www.cs.ruu.nl/~daan/parsec.html >
[1] HXML: <URL: http://www.flightlab.com/~joe/hxml >
(Note: HXML release 0.2 will be ready Real Soon Now, and there have been
many incompatible changes since 0.1. The main thing left to be finished
is the documentation, if you can live without that let me know and I'll
put a snapshot up.)
--Joe English
jenglish@flightlab.com