[Haskell-cafe] I want to write a compiler

Sat Mar 7 20:45:06 EST 2009

Hi,

(Please note this is coming from my own experience working with the LHC haskell
compiler, as well as a compiler I'm currently working on in SML. I'm
not an authority, but as another greenhorn compiler hacker I thought I
might give some advice.)

Excerpts from Loup Vaillant's message of Sat Mar 07 17:32:48 -0600 2009:
> Ideally, I would very much like to compile to C.
> 
> The requirements are easily stated. My language must
> -> be lazy by default,
> -> support algebraic data types and case expressions (unless I can get
> away with encoding them as functions),
> -> have some kind of FFI with C (I suppose it imply some support for
> unboxed values).
> 
> There is also the first class functions thing, but lambda lifting is okay.

Unfortunately, after working on LHC, I can verifiably say (like all
the researchers who wrote the papers which I *didn't* read
beforehand,) that for a lot of purposes, C is an unsuitable fit for a
compilers' target language. It works pretty well if you can make the
source language execution model fit well enough, but if you can't,
you've a price to pay (both in sanity and/or performance.)

(On that note, I am currently of the opinion that most of LHC's major
deficiencies, aside from a few parser bugs or some needed
optimizations, comes from the fact that compiling to C is currently
our only option; because of it, we have no exception handling or
proper garbage collection at all. As well, the runtime system is a
little needlessly 'clever' (if small and understandable) so it can
deal with that.)

However, since your goal is *not* efficiency, you will be happy to
know that the issues with compiling to C (like garbage collection and
exception handling) can be worked around viably.

For garbage collection, please see.

"Accurate Garbage Collection in an Uncooperative Environment" -
http://citeseer.ist.psu.edu/538613.html

This strategy is currently used in Mercury as well as Ben L.'s DDC
language; on that note, I think if you spent some time looking through
the runtime/generated code of DDC, you can see exactly what the paper
is talking about, because it's actually a very simple strategy for
holding onto GC roots:

http://code.haskell.org/ddc/ddc-head/runtime/

However, it's reasons like this (C is really hard to compile to
effectively) that Simon & co. have spent so much time on the C--
project. C-- is one of the backend languages used in GHC, and it is
designed to be a 'uniform assembly language' for high level languages
to compile to.

You can find a lot of info on C-- here; I recommend the paper at the
bottom to start off:

http://research.microsoft.com/en-us/um/people/simonpj/papers/c--/

These papers should further illustrate the issues with compiling to C;
for a pedagogical excursion, these issues can all be (simply) worked
around for the most part like Henderson's paper illustrates, but
they're worth keeping in mind, too.

Hopefully those papers should help you concerning your compilation
model and the route you would like to take. I can't say much about
laziness, other than read Simon Peyton-Jones' actual full book (it's
an amazing read):

http://research.microsoft.com/en-us/um/people/simonpj/papers/slpj-book-1987/

That should help you concerning laziness/compilation etc..

As for the FFI, I don't really have any papers on 'how to implement an
FFI', but Matthias Blume's paper might be of interest:

"No-Longer-Foreign: Teaching an ML compiler to speak c "natively"" 

http://ttic.uchicago.edu/~blume/papers/nlffi-entcs.pdf

libffi may also be worth investigating:

http://sourceware.org/libffi/

> I have chosen the STG machine because I thought i would be easier to
> get an FFI and case exprs out of it. Maybe I am wrong, and template
> instantiation is really easier. There is also the BRISK machine, and
> GRIN. But the paper I read were less readable to me than the three
> mentioned.

How far into GRIN have you looked? It is one of the intermediate
languages for lhc/jhc, and for a low-level intermediate representation
it works quite well I think; it is a strict, monadic intermediate
language - being strict, we must still be able to model closures
('suspendeded applications' or thunks,) so we model closures/partial
applications in a sort of 'defunctionalized' way (making grin a sort
of 'defunctionalized intermediate language' really,) by turning them
into something along the lines of an algebraic data type in GRIN. A
good benefit of this is that you actually /don't/ have to write
eval/apply yourself, because the compiler is supposed to generate it
*for* you.

In fact, this is a core part of GRIN - the generation of eval by the
compiler is critical for many important optimizations to take
place. It also makes the runtime a bit simpler.

This comes at the price that GRIN is by design a whole-program
intermediate form; in order to pull off any of these sophisticated
transformations everything must be in memory at once. But as we have
seen with LHC/JHC, it can make a *huge* difference (apps that are 10x
smaller *and* 10x faster than ones created by GHC.)

Having dealt with GRIN as I work on LHC (when time now permits...) I
can say that it's a reasonable strategy and in practice it turns out
pretty well, but if you're not into optimizing the code, then STG
might be a better fit.

I can't really give you that much advice on this area so much as
you'll just have to choose. But GRIN does have at least two excellent
references:

 Code Optimisation Techniques for Lazy Functional Languages -
 U. Boquist, http://www.cs.chalmers.se/~boquist/phd/index.html

 The GRIN Project: A Highly Optimising Back End For Lazy Functional
 Languages, U. Boquist, http://www.cs.chalmers.se/~boquist/ifl96-abstract.html

Then again you said you're not interested in optimization, so perhaps
I'm just preaching to the totally wrong choir... :) 

Austin