[Haskell-cafe] Parallel parsing & multicore

Edward Kmett ekmett at gmail.com
Wed Sep 9 19:31:00 EDT 2009


Actually you can still do the same trick, you just need a smarter delimiter
-- ironically the same one I use in Kata. ;)  Look for non-backslashed
newlines rather than newlines in general and start there.
You need a little bit of book-keeping in case a backslash and newline
straddle the chunk boundary but it bundles up and gets hidden from you by
the monoid.

As for the triple quoting or mixed mode languages, I have three options that
I've explored:

1.) I have a set of parser combinators that I've been working on that parse
fully monoidally, which are sufficient for context-free attribute grammars
which can tackle that problem head on. But you give up the monadic sugar
that makes parsec so syntactically sweet. (Though I need to change its name
since another project named parsimony just appeared on hackage!)

2.) You can define a reversible lexer, whereby given a list of tokens you
can generate the original input string, but then you spend a lot of time
gluing string fragments together when you started with a perfectly good
bytestring.

3.) My favorite option, using the iteratee backed parsec parser from the
post you can 'slice' the input bytestring between two cursor positions
efficiently. When you start parsing in a superposition of states and the
only marker flips your mode rather than cleanly starts or starts it,
especially when one is boring like a triple quoted string, just mark the
location of the start of your lexer, and maintain two token lists, flipping
their roles at each transition. All you wind up doing is grabbing a few
extra bytestring slices and lexing the body of your string using the general
purpose lexer just in case. Nicely if you have edge cases that cannot occur
in one sublanguage or the other, then you can use that to collapse out of
the superposition of states into a single state. I'm using this approach for
parsing CDATA sections in XML monoidally, which has the same issue.

-Edward Kmett


On Wed, Sep 9, 2009 at 3:58 PM, Anakim Border <akborder at gmail.com> wrote:

> Hi Edward,
>
> I read your slides a few weeks ago when they were posted on Reddit and
> found your approach very interesting. In fact it provided the
> inspiration for the parser I've written.
>
> I did not use the same strategy, however, because parsing makefiles
> poses its own challenges. The make syntax is actually the union of two
> different languages, one for defining targets and one for commands.
> It's easy to identify commands following target definitions (they
> belong to lines starting with a tab), but if you jump to the middle of
> a makefile you can't decide, for example, if you are inside a define
> block (that should be parsed like a command) because the lines in its
> body are not marked by any specific prefix. Thus the need for a
> sequential scan. All this may seem rather involved and specific to an
> unusual format, but one encounters the same problem when parsing
> here-documents in shell scripts or multi line (triple-quoted) comments
> in Python.
>
> The cache trashing effect you mention is certainly an issue I did not
> foresee. I guess a possible solution would be to base parMap on
> different 'par' primitive; one that sparks a number of computations
> equal to the number of available OS threads and then blocks until at
> least one of them is done.
>
> -- AB
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20090909/aebd119f/attachment.html


More information about the Haskell-Cafe mailing list