[Haskell-cafe] Hierachical abstraction

Thu Jan 28 14:42:55 EST 2010

OK, this is going to be a long one...

Let's start with parsers. A monadic parser library (e.g., Parsec) gives 
you a set of primitive parsers. It also supplies functions to construct 
more complex commonly-used parsers. And then it provides functions for 
constructing even more complex parsers. (E.g., Parsec's expression 
parser builder.)

Let's call the most basic primitive parsers level-0. Then we have 
level-1 functions, which construct parsers from level-0 parsers. And 
later we have level-2 functions which call the level-1 functions to 
construct even more sophisticated parsers. And so on and so forth, to 
unlimited depth. (The parser library itself has a limited depth, but the 
library can have clients that build functions on top of functions.)

And so, in the end, the user calls a level-64 function, and gets a 
parser that does the required task. And that's great.

The only problem is, the parser you get is a black box. You can't see 
what's inside it or what it's going to do. You can only run it. (Or 
combine it with other parsers.)

Why is that a problem? Well, suppose you want to pass the finished 
parser through some kind of optimiser to rearrange its internal 
structure. Ah. You can't. It's a black box.

Veering off onto a tangent for a moment... It seems to me that this is 
an inevitable consequence for any monadic parser. After all, the bind 
method runs some parse primitive, but it then feeds the result to a 
general function, which can undertake some arbitrarily complex 
Turing-complete calculation to decide what parse primitive to run next. 
Hell, I can write a parser that parses a description of a grammar, 
dynamically constructs a parser for that grammer, and then runs it! 
There's no way in hell that some static optimiser can hope to optimise 
the meta-parser so that it always generates optimal parsers. Even in 
theory it's utterly impossible.

I wonder if you can make a parser combinator library which is *not* 
monadic? And, if you could, in what way would that limit the kinds of 
things you can parse?

Anyway, coming back on topic... Let's suppose that instead of parsers, 
we've interested in queries. Suppose I'm writing some kind of relational 
database engine library, and I want to optimise queries before running them.

The obvious solution is to design some kind of algebraic data type which 
can represent every possible level-0 query primitive. I can then write 
level-1 functions that construct level-0 data structures, and level-2 
functions that call the level-1 functions, and so on, and at the end the 
caller will get a (presumably complex) level-0 description, which my 
optimiser can then go away and chew on.

Now suppose I want to perform optimisations on the higher-level 
structure of the query. Currently, I can't. The hierachy of abstractions 
gets flatterend to level-0 in the process of the call stack, so my 
optimiser can only see the final level-0 representation.

The obvious solution would be to design an algebraic data type for every 
level of the hierachy you want to analyse, complete with a function for 
converting to the next level down. As I understand it, this is how GHC 
works. It starts with raw Haskell source code, which it mashes around, 
and eventually converts into Core. The Core representation is then 
fiddled around with further, and eventually turned into STG. The STG 
gets poked and prodded, and converted to C--. And that eventually 
becomes raw assembly.

The thing is, these are all radically different representations of the 
program. You can't mix (say) Haskell and STG together. And nor would you 
want to, given that they're such wildly different abstractions.

[I notice in passing that, apparently, you can now FFI into C--, which 
is interesting...]

However, consider the initial Haskell input. Initially it reflects what 
the programmer typed in. But GHC gradually desugars this into some 
minimal subset of the full Haskell language, before actually converting 
it to Core. That means that [picking an example out of thin air] 
initially an expression might contain the if/then/else construct, but at 
some point [I presume] this is desugared into a case expression. And 
that must mean that there are parts of the GHC codebase where it is 
permissible for an expression to contain if/then/else, and other parts 
where this is not permissible (and the code expects this construct to 
not be present).

Is it possible to encode this kind of expectation into the type system? 
Or does GHC just tacitly rely on the fact that source code takes a 
predetermined path through the compiler engine? (That works for a 
monolithic product like a compiler, but it wouldn't be so hot for a 
library that 3rd parties can poke and prod arbitrary data into.)

I'm basically looking at a problem which is like this. There are 
primitive constructs, and there are more sophisticated abstractions 
built from these, and I would like to come up with a system where 
abstractions are first-class, new abstractions can be added, and for any 
particular data structure, you can statically guarantee which 
abstractions are or aren't present.

Has anybody ever looked at a general solution to something like that?

[I'm going to stop talking now...]