A new lexer/parser for c2hs [was: [C2hs] Re: support for 6.4]

Mon May 30 14:18:41 EDT 2005

On Thu, 2005-05-26 at 16:46 +1000, Manuel M T Chakravarty wrote:
> Duncan Coutts:
> > On Wed, 2005-05-25 at 11:44 +1000, Manuel M T Chakravarty wrote:
> > 
> > > An Alex/Happy parser would be an option if it improves matters
> > > significantly.  If you or anybody else has a go at it, please follow the
> > > C language definition closely, as does the existing lexer and parser
> > > (with comments stating to which sections in K&R the individual
> > > productions relate).
> > 
> > Intrim status update:

Another status update:

I started with a yacc C grammar and added the semantic actions by
translating from the existing parser. I reordered the clauses to match
the order in the original parser and kept the comments (but removed the
bits that are no longer true). I think I've added back in all the
syntactic extensions supported by c2hs. There is one shift/reduce
conflict to do with the "if then else" syntax which I believe is benign.

It's not quite finished yet (in particular I've not done allocation of
attributes), but it does already parse the Gtk+ headers. It is quite
quick and importantly uses very little heap space. On my old 500MHz
sparc, parsing the Gtk+ headers takes 8 seconds and requires 28Mb of
heap. There appears to be essentially no allocation apart from that used
by the AST, since the parser does not get slower when the heap size is
very slightly larger than the minimum required. Depending on the heap
limit it either runs at full speed or runs out of heap.

Going back to the lexer, it now produces exactly the same output as the
original lexer (including positions and unique names). Sadly it seems to
have got quite a bit slower for reasons I don't quite understand. In
particular making it monadic (which we need to do because of) seems to
make it rather slower. It is now taking 6 seconds rather than 2 and so
is now only a little faster that the original lexer. Though on the
positive side it means that if the lexer is taking 6 out of the 8 second
total then the parser is only taking 2 seconds which is quite good.

One important speedup I got was to change the identifier vs. reserved
word lookup so that it does not use a linear search over the list of
reserved word strings. That cut overall parsing time down from 10
seconds to 8.

I've slightly generalised the grammar for GNU C __attribute__s after
looking at the GCC manual.

So the remaining things to be done before posting the stuff for review
is to get the attribute allocation going and to check that the AST
produced by this parser is exactly the same as for the existing parser.
Then integrate this parser into c2hs and see what the overall memory
requirements turn out to be taking into account the name analysis phase.

> > > Moreover, the module c2hs/base/syntax/ParserMonad.hs already provides
> > > a monad suitable for Happy.
> > 
> > I'll take a look.

Unfortunately, since the monad used by the parser has to be the same as
the one used by the lexer, the existing parser monad is not suitable.
I've based another one on the happy monad example and the ghc
lexer/parser monad.

> > As for the lexer/parser interaction required for C, I guess the way to
> > do this is to make the lexer monad keep the set of 'typedefed'
> > identifiers and return the appropriate token type to the parser
> > depending on membership of that set. The lexer monad should also expose
> > an operation to the parser to add identifiers to the 'typedefed' set
> > which the parser would use after parseing the appropriate kind of
> > declaration.
> 
> Sounds right to me.

I got this scheme working after encountering one interesting problem
along the way...

I had to refactor the C grammar slightly to move the reduction rule for
typedefs (which adds the typedef'ed names to the lexer/parser monad
state) below the ';' terminal in declarations so that constructs like
the following work:

typedef int foo;
foo bar (int baz);

Otherwise the 'foo' token on the second line is not recognised as a type
name since the reduction rule for the previous declaration is not done
until the 'foo' token has already been seen. With the slightly
refactored grammar/semantic actions the reduction rule is invoked as
soon as the ';' token is encountered and so it all works. (With thanks
to Heffalump and skew on #haskell for help on this issue)

Duncan