GHC core representation
Alastair David Reid
reid@cs.utah.edu
04 Jun 2001 22:23:10 -0600
[This is in reply to Andrew Tolmach's mail to the Haskell list. He
asked that replies should be sent to ghc-users.]
> Many months after the topic was first raised, there is finally a draft
> document describing a formal external syntax for GHC's "Core"
> intermediate language.
Wow! This is very cool - it answers so many questions that I
previously had to examine the source code to figure out!
Two meta-comments before I go into detailed comments.
1) GHC documentation tends to have a very short half-life.
What can be done to minimize this problem?
For example can the tables of primops in section 4 be automatically
generated?
2) I haven't seriously hacked on GHC or the the GHC frontend for
several years so some of the following comments will be a little
off. I'm hoping that people who are more up to date will correct
me and that their corrections will be useful to anyone like me
who used to know GHC but have lost track of the details.
OK, onto the detailed comments.
1) GHC identifiers have SrcLocs (filename/linenumber) attached to them.
It might be useful to be able to pass those into your parser.
(But what are the SrcLoc's used for in Core code? Would they be useful?)
2) The names you use for operations all use the somewhat unreadable
Z encoding. I think you should Z-decode these names because:
1) They're a lot more readable. e.g., decode "ZLZmZgZR" = "(->)"
2) It's possible that GHC will get confused if you fail to Z-encode
a name. For example, it isn't possible to have the name "Z"
in Core because GHC encodes the Haskell identifier "Z" as "ZZ".
From memory, the Z encoding is something like this:
ZL -> (
ZR -> )
Zm -> -
Zg -> >
Zl -> <
Ze -> =
Zc -> :
Zp -> .
Zxxx -> chr(xxx) where xxx is a sequence of digits
3) The restriction that Main.main have type IO a is unfortunate and, I
think, unnecessary. It shouldn't be that hard to change it so that
its type is more like:
State# RealWorld -> (# State# RealWorld, Void# #)
4) While explaining namespaces, it'd be convenient to point out that
you put a % in front of all keywords.
Hmmm, I guess this is one reason to use Z-encoded names: % is a
legal Haskell identifier.
5) The @a in datatype declarations threw me for a while.
Later I saw you using it like BigLambda in an explicitly typed
lambda calculus and understood what you were doing. It'd be
worth making the connection explicit.
6) An alternative way of defining data constructors would be like this:
%data BinTree :: * -> * = {
Fork :: %forall a . BinTree a -> BinTree a -> BinTree a;
Leaf :: %forall a . a -> BinTree a
}
That is, specify the actual type of data constructors instead of
using Haskell datatype declaration syntax. This makes the language
a little easier to specify.
Notice that this syntax is a little more liberal than standard Haskell
syntax because you could write types like:
%data Expr :: * -> * = {
Int :: Int -> Expr Int;
App :: %forall a, b . Expr (a -> b) -> Expr a -> Expr b;
Lam :: %forall a, b . Var a -> Expr b -> Expr (a -> b)
}
which cannot be expressed using normal Haskell syntax because
the result types of each constructor are different. I saw
Lennart mention this idea in a talk about 8 years ago and I've
always wanted to play with it. :-)
On the other hand, you might avoid this generalisation in case
GHC does something weird with it.
7) In section 3.6, you say:
"Value applications may be of user-defined functions, data constructors
or primitives. Application of the latter two sorts need _not_ be
saturated."
I think you mean "none of these applications need to be saturated
although both previously published descriptions of Core required
that the latter two be saturated."
8) You say that the list of case alternatives "need not be exhaustive,
even if no default is given; it is a disastrous run-time error if a
needed case arm is missing."
1) Just how disastrous? Is an exception raised or does the RTS crash?
2) I feel a little uneasy with this design decision.
9) You don't mention the _scc_ operation used for profiling.
10) There's a move on to define %ccall in a more generic way
that would apply to calling Java functions, .net functions, etc.
This is bound to result in a change of the "ccall" name.
It'd be worth adding a Working note to that effect.
11) The discussion of the terms "unboxed", "heap allocated" and
"unlifted" (page 9) doesn't seem quite right. I believe it is:
Lifted types must be heap allocated. Unlifted types may be heap
allocated (e.g., Array#s) or unboxed (not heap allocated) (e.g., Int#).
12) The discussion of operations on MVar# (page 10) says that they
take an initial type argument of the form (State# t) for some
thread "t". Is this true? Does "t" not have to be RealWorld#?
What does it mean if it is not RealWorld#?
13) Are the CCallable and CReturnable classes still there? Blech!
[I'm the one who implemented them - I'd hoped they'd gone by now.]
14) If strings are represented by the "address of a C-format string"
(section 4.2), how do we represent strings with embedded \0 characters
in them?
15) dataToTag#, tagToEnum# and getTag# (section 4.4.1) might be used
to implement the to/fromEnum operations but they may also be
remnants of GHCi version 0.0 - a metacircular interpreter that
was in GHC version 0.18 (or thereabouts). If the latter, someone
ought to give them a decent burial.
16) You document unsafeCoerce# but not
reallyUnsafePtrEq# :: a -> a -> Bool
did someone finally kill that?
17) decodeDouble# and decodeFloat# are used to extract the exponent
and mantissa of a floating point number. Ignoring unboxity,
it returns an Int (the exponent) and an Integer (the mantissa).
Representing that as unboxed types, you get (# Int#, Int#, ByteArray# #)
18) Section 4.4.8 asks what the relationships are between quotient,
remainder, div and mod. This ought to be the same as that specified
in the Haskell report for their boxed equivalents.
19) Your question about what indexArray# does and what alignment
restrictions apply reminds me of a subtlety:
When indexing into a ByteArr# or a Addr#, is the index
scaled by the size of the object being read/written (as in C)
or not?
I am pretty sure that the index on an Addr# is not scaled (and the
value of the (Addr# + Int#) combination is subject to exactly the
same alignment restrictions as imposed by the hardware architecture.
I don't recall the story for ByteArr# but a ByteArr# object is
always aligned to the nearest 4 (maybe 8) byte boundary.
20) Section 4.4.15 says that takeMVar# and putMVar# "die" if the MVar is
empty (respectively, full).
1) What does "die" mean?
More generally, how is failure implemented in the primops?
Do they raise exceptions? Do they return error codes?
Do they call exit? Do they crash the RTS?
2) Are you sure these ops "die"? Their unboxed Haskell brethren
will block the thread under the same circumstances.
21) You repeatedly mention the sequential implementation of the RTS.
Last time I looked, it wasn't possible to build the RTS with
concurrency turned off.
If one were to build the GHC RTS with concurrency turned off, I
think the concurrency primitives should not be available rather
than providing versions that always/mostly fail. Or maybe one could
implement non-preemptive "concurrency" in much the same way as we
did in Hugs? (Not sure I recommend this route - the interactions
between threads and exception handlers in Hugs are somewhat tricky.)
--
Alastair Reid reid@cs.utah.edu http://www.cs.utah.edu/~reid/