[C2hs] a new C parser

Sun Feb 18 20:43:42 EST 2007

All,

I posted some time ago about a plan to make a new C parser that would
correctly deal with the typedef problem and then to try some automated
testing by making a gcc wrapper. I've now mostly done that.

Currently I can parse almost all of the Linux kernel, glibc, GNU
binutils, and the C code produced by ghc when it compiles itself.

I can also parse most of the artificial nasty typedef examples:

typedef int A, B(A);

I think I correctly deal with scope of typedef names and typedef names
being shadowed by local variable declarations:

typedef int A;
A main () {
  A A;        -- declares local var A as having type A
  A = A + 1;  -- A now refers to the local var, not the type;
}
A x;  -- end of scope, A reverts to referring to the typedef name.

A slightly more tricky example of the same thing is where function
parameters redeclare names:

typedef int A;
A main (A A) {
  A = A + 1;  -- A now refers to the local var, not the type;
}

An example I don't deal with yet is:

typedef int A;
int A = sizeof(A);

that is we shadow the typename A with the local var A, so the A in the
initialiser should now refer to the local var not the typename. However
currently the parser only does the shadowing after the ';', not after
the '=' as would be required. This is because the production rule looks
like:

declaring_list
  : declaration_specifier declarator initializer_opt
      {% do doDeclIdent $1 $2 
            return $ CDecl $1 [($2, $3, Nothing)]) }

where as we'd really like to say:

declaring_list
  : declaration_specifier declarator {% doDeclIdent $1 $2 } initializer_opt
      { CDecl $1 [($2, $4, Nothing)]) }

that is to modify the typedef set immediately after the declarator and
before parsing the initializer. This is exactly how people do it with
YACC but unfortunately happy does not support this form where actions
are interspersed in the production. It's probably possible to refactor
the grammar to do this with happy, though it's not totally obvious to me
how.

Dealing with '__attribute__'s:
==============================

GNU C attributes are a real pain. They seem to be able to appear almost
anywhere in a declaration. However adding them to the grammar has to be
done very carefully to avoid introducing ambiguities (ie shift/reduce or
reduce/reduce conflicts).

So for the moment I've taken the approach of not parsing them at all but
recognising and ignoring attributes in the lexer and not passing them on
to the parser at all. This effectively treats attributes as whitespace
and they can appear anywhere.

This isn't a long term solution because eventually we'll need to
recognise some attributes because they can affect structure layout and
we need that for accurately calculating the size of types and offsets of
structure members.

C99 extensions:
===============

I'm not sure I've got 100% coverage of C99 features, but at least I
cover the ones used in the code I've seen so far. These are just the
ones that I can remember adding:

      * mixed declarations and statements in a compound statement
      * compound literals
      * for statements with declarations: for (int n = 0; n < m; n++)
        {};
      * wide character character and string literals
      * unsigned long long numeric literals
      * _Bool and _Complex basic types

GNU extensions:
===============

Again, just some of the GNU extensions that I can remember adding to the
new parser compared to the extensions the previous parser recognised:

      * anonymous structs and unions inside other structs/unions.
      * thread local storage qualifier
      * old style and array range compound literal member designator
      * case ranges
      * computed gotos and address of labels
      * conditional operator with missing then part: "x ? : y"
      * empty structs
      * allow redundant ';' in several places
      * asm expressions
      * several built-in 'functions' that take type names as parameters

Various of these extensions require extensions in the AST.

Testing:
========

I've made a cc-wrapper program that you can use as if it were gcc. It
calls gcc with the same args as it was called with. Then if it can
figure out what the gcc args meant and it looks like gcc was being asked
to compile a .c file then it tries to pre-process and parse the same .c
file. It outputs results to a log file and puts detailed reports on
parse failures into separate files (in a directory specified via an
environment variable). So (in theory at least) one ought to be able to
compile vast amounts of C code and find out which extensions are really
used and break the parser.

At the moment the it doesn't check that it correctly parses the code,
only that it parses it at all. Extra checks as I mentioned before could
include checking if the sizes of types and offsets of structure members
computed by c2hs match those computed by gcc. Another test would be to
pretty-print the C code again and see if gcc can still parse it and if
it produces the same object code (that'd require fully parsing
attributes).

As I said, I've been testing this on the linux kernel (using
allyesconfig) glibc and ghc. Last time I tried it against linux it
failed on 7 out of 4406 .c files. All those failures were due to the
same bug which I think I've now fixed. In glibc, a few files fail to
parse due to their use of nested function declarations which is a rarely
used GNU extension that I've not implemented yet.

Plan:
=====

So there are a few things to do, one is to continue testing large
amounts of C code. I intend to include cc-wrapper in c2hs (or perhaps do
it as a c2hs mode via a flag) so that users can do their own testing and
provide a way to give detailed bug reports.

I haven't checked yet that the new C parser doesn't break things when
doing the full .chs -> .hs translations. Also, I've not extended the
guts of c2hs to understand any of these new language extensions so
trying to bind things which really use them may not work.

Then I need to start merging the new parser into the darcs version of
c2hs. I'll try and do this as a series of small understandable patches.

If anyone wants to play with the code I've got right now rather than
wait for things to get merged then do say. I could put up a tarball.

Duncan