[C2hs] anyone interested in developing a Language.C library?

Fri Apr 20 03:12:44 EDT 2007

Hi all,

If anyone is interested in developing a Language.C library, I've just
completed a full C parser which we're using in c2hs.

It covers all of C99 and all of the GNU C extensions that I have found
used in practise, including the __attribute__ annotations. It can
successfully parse the whole Linux kernel and all of the C files in all
the system packages on my Gentoo installation.

It's implemented as an alex lexer and a happy parser. The happy grammar
has one shift/reduce conflict for the dangling if/then/else issue (which
could be hidden by using precedence but it's clearer not to).

So if someone is interested in developing a more widely usable
Language.C library, I think this would be a good place to start. There's
plenty to do however:
      * The c2hs C AST is ok but probably not enough for a general
        purpose library.
      * The parser currently uses some other c2hs infrastructure which
        would need disentangling to pull the parser out (mostly
        identifiers and unique name supply management).
      * It does not record everything into the parse tree, eg
        __attribute__s are parsed but ignored.
      * It does no semantic analysis after parsing (though other bits of
        c2hs to a very little)
      * In at least one place the parser is deliberately too liberal (to
        avoid ambiguities) which would require simple extra checks after
        parsing to detect.
      * The lexical syntax has not been checked against the spec fully,
        it is probably over-liberal in some cases.
      * I've not done much performance work, the lexer has not been
        seriously tuned, it still lexes via a String. Having said that,
        the performance is not at all bad, on a 3Ghz box it does ~20k
        lines/sec.
      * The parser error messages are terrible (it might be interesting
        to try porting from happy to frown for this purpose)

There's probably more stuff, but that's what I can think of right now.

So if anyone is interested then let me know, I can give some pointers
(hopefully the useful kind, not the void * kind).

You can get the code from the c2hs darcs repo:
darcs get --partial http://darcs.haskell.org/c2hs/
The C parser bits are under c2hs/c/

Duncan

Licensing:
It's not 100% clear. At the moment it's marked as GPL, but it's derived
from several sources so we need to be careful about that. Personally I'm
happy to use LGPL. It derives from c2hs obviously, which is GPL, though
we could enquire about re-licencing, especially since there is very
little of c2hs stuff used in it any more. It also derives partly from
James A. Roskind's C grammar (in particular the grammar of
declarations). His copyright license is fairly liberal but this need
double-checking. It also derives from the C99 spec and I read the
comments in the gcc C parser as a guide to GNU C's extensions to the C
grammar (no code or comments were copied however).

Testing:
I tested it thus far by writing a little gcc wrapper script, so you can
build any ordinary bit of C software using this wrapper and it'll call
gcc with the same args but it'll also try and parse the input file. It
reports into a log file. I've not tried the gcc C parser testsuite. This
approach is probably good for other tests like trying to see if parsing
and pretty printing can round-trip correctly; if not identical token
streams (since parsing drops redundant brackets etc) checking if gcc
produces identical .S/.o files. Something that c2hs needs is to
calculate sizes of types and structure member offsets correctly. This is
also something that could be tested in this style, by comparing on
thousands of example .c files with what gcc thinks.