[Haskell-cafe] Subsets and supersets

Wed Mar 16 13:13:35 CET 2011

The other day I saw the "green field Haskell" discussion on Reddit. Of 
course, I could give you a very long list (ha!) of things that I would 
change about Haskell. (This would probably begin with expunging 
Monad.fail with extreme prejudice...) But there's one particular feature 
which is non-trivial enough to be worth discussing here...

Haskell has ADTs. Most of the time, these work great. As I've written in 
several other places (but possibly not here), OO languages tend to 
"factor" the problem the other way. That is, if I want a binary tree, an 
OO language makes me split the type and all of its operations into three 
parts (an abstract base class, a branch subclass, and a leaf subclass). 
Adding each new operation requires adding an abstract version of it to 
the abstract base class, and putting half of the implementation into 
each concrete subclass.

Haskell is factored the other way. The definition of the type goes in 
one place. The definition of each operation over it goes in one place. 
More can be added at any time. In particular, the OO approach doesn't 
let you add new tree operations without altering the source code for the 
tree classes. The ADT version *does* let you add new operations - 
provided the tree structure is visible.

This is not to say that the OO approach is wrong, of course. The 
important thing about a binary tree is that there are exactly two kinds 
of nodes - leaves and branches - and there will *always* be exactly two 
kinds. That's what makes it a binary tree! [OK, no it isn't, but bare 
with me.]

If, instead of a binary tree, or an abstract syntax tree, or some other 
fixed, unchanging structure, we wanted to deal with, say, bank 
accounts... Well, there are a bazillion kinds of bank account. And more 
might be added at any time. Here an ADT is utterly the wrong thing to 
do. It would have a bazillion constructors, and all the processing for a 
particular bank account type would be split between dozens of seperate 
functions. By contrast, an OO language would put each account type and 
all its associated processing in the same place - the class definition.

In summary, problems can be factored two ways. ADTs do it one way, 
classes do it the other way, and both can be appropriate. OO languages 
use classes, but, happily, Haskell has ADTs *and* classes! :-D And not 
classes in the same sense as OO classes; the way Haskell does it is 
actually better, IMHO.

Anyway, coming back on-topic... ADTs are great, but sometimes they don't 
let me easily express exactly what I want. Specifically, sometimes I 
have one type which is really a subset of another type, or perhaps a 
superset of several times.

You can build subsets using GADTs, after a fashion:

   data Foo
   data Bar

   data Foobar x where
     Foobar1 ... :: Foobar Foo
     Foobar2 ... :: Foobar Bar
     Foobar3 ... :: Foobar x
     Foobar4 ... :: Foobar Bar
     Foobar5 ... :: Foobar Foo

Now any expression of type Foobar Foo is guaranteed to contain only 
Foobar1, Foobar3 or Foobar5, while any expression of type Foobar Bar is 
guaranteed to contain only Foobar2, Foobar3 or Foobar4. As the tangle of 
sets becomes more complicated, this approach becomes more difficult to 
apply.

You can of course build supersets using an even more mundane Haskell 
construct. The canonical example is Either. If I want a field that holds 
a Foo *or* a Bar, I can trivially write Either Foo Bar.

The trouble is, this entails an extra level of indirection. In order to 
access a Foo or a Bar, I now have to strip off the Left or Right 
constructor to get to it. To construct a value, I have to stick in the 
constructor. Sometimes this is just fine, even desirable. For example, I 
recently wrote a function who's type is

   spans :: (x -> Bool) -> [x] -> [Either [x] [x]]

Here the Left and Right constructors are actually /telling/ me 
something. They convey /information/.

On the other hand, in the case of Either Foo Bar, the constructor tells 
me nothing. Consider:

   data Foo = Foo1 | Foo2 | Foo3
   data Bar = Bar1 | Bar2 | Bar3

   foobar :: Either Foo Bar -> x
   foobar e =
     case e of
       Left  Foo1 -> ...
       Left  Foo2 -> ...
       Left  Foo3 -> ...
       Right Bar1 -> ...
       Right Bar2 -> ...
       Right Bar3 -> ...

There is, logically, no particular reason why we shouldn't simplify the 
function to just

   foobar Foo + Bar -> x
   foobar fb =
     case fb of
       Foo1 -> ...
       Foo2 -> ...
       Foo3 -> ...
       Bar1 -> ...
       Bar2 -> ...
       Bar3 -> ...

The set of possible constructors for Foo and for Bar are disjoint. There 
is no possibility of ambiguity here. The type system itself even /uses/ 
this fact to infer the type of literal constructor applications! And 
yet, you cannot write code such as the above. Not in Haskell, anyway.

So what is the use case here? Well, consider for a moment the problem of 
parsing Haskell source code. Haskell contains both "patters" and 
"expressions". Now actually, it turns out that a pattern and an 
expression have a very similar structure. The set of possible patterns 
is /almost/ a subset of the set of possible expressions. (But not 
quite.) That gives me design choices:

* Design one ADT which encodes both patterns and expressions. Since this 
removes the ability to statically distinguish between them, this is 
almost certainly the wrong choice.

* Design one ADT for patterns and a seperate, unrelated ADT for 
expressions. I have now violated the DRY principle. And, hypothetically, 
if I wanted to treat a pattern as an expression, I would have to write a 
lot of boilerplate code that does nothing more than replace one 
constructor name with another (recursively).

* Do something with GADTs, and hope the type system doesn't explode.

Note that a literal value such as 5 or "foo" is both a valid expression 
and a valid pattern. In fact, I probably want an ADT just for literal 
values (since there's quite a few possible forms, and often I don't care 
which).

The net result of all this is usually something like

   data Literal =
     LInteger Integer |
     LChar    Char    |
     LString  String  |
     ...

   data Pattern =
     PLiteral     Literal        |
     PConstructor Name [Pattern] |
     PVariable    Name           |
     PWildcard                   |
     ...

   data Expression =
     ELiteral     Literal              |
     EConstructor Name [Expression]    |
     EVariable    Name                 |
     ELambda      [Pattern] Expression |
     ...

   expr1 = EConstructor "Foo" [ELiteral (LString "bar")]
   patt1 = PConstructor "Foo" [PLiteral (LString "bar")]

The entire Literal type just looks plain wrong. The set of possible 
Integer values and the set of possible Char values are utterly disjoint. 
Really I ought to be able to just say which types Literal is the union 
of, and not worry about it any further.

(There is the minor detail that the set of Integer values and Double 
values are *not* disjoint. There is also the minor detail that if any 
valid Char is also a valid Literal, then the type of 'x' is now no 
longer unambiguously Char. This makes type inference "interesting".)

The case of patterns and expressions is more interesting, since here we 
have two overlapping sets, yet neither is strictly a superset of the 
other. Today we could probably solve this as just

   data Haskell x where
     HLiteral Literal :: Haskell x
     HConstructor Name [Haskell x] :: Haskell x   -- Is this right?
     HVariable    Name             :: Haskell x
     HWildcard                     :: Haskell Pattern
     HLambda      [Haskell Pattern] (Haskell Expression) :: Haskell 
Expression
     ...

In other words, by carefully applying GADTs, the pattern/expression 
problem is probably solveable today. But it's tricky. And it requires me 
to create two useless nullary types that exist only to let the type 
checker distinguish patterns from expressions. (Fortunately empty data 
declarations can be enabled quite easily.)

It would just be nice if I would trivially and easily take the union of 
several types, and if this was a fundamental operation. Then we would 
probably build type hierachies in a quite different way. Of course, 
there are several "minor details" which would need to be overcome first:

1. The Haskell 2010 type system guarantees that 'x' has type Char. If we 
let people define supersets of types, how is the poor type checker 
supposed to know exactly which type 'x' is meant to belong to?

2. I might not want to explicitly notate every single pattern and 
expression with its exact type, but /the compiled code/ needs to know! 
When a Char value becomes a Literal value, how does a case expression 
handle that? Currently (AFAIK) each constructor in a type is assigned a 
unique ID, and case expressions match against that. But if you take the 
union of two types, now what? Constructor IDs are only unique within a 
single type, not within some arbitrary union of types.

3. It strikes me that the RTS (particularly the GC) is going to need to 
know what the hell we're doing too.

The most obvious solution to #1 is that the type of every term is the 
simplest possible type, unless you try to unify it with some other type, 
at which point the type for both such terms becomes the union of types. 
I rather suspect that this will instantly make every possible program 
well-typed, however. I suppose an alternative is to require explicit 
type signatures, but then we just lost the entire advantage of this 
construction.

As to #2, maybe this whole type union insanity could be considered 
merely syntax sugar, with the whole edifice being desugared into the 
multiply constructor wrapping craziness we have now. But it seems a 
pitty. If we could find a way to assign a globally unique ID to every 
constructor (code address pointer perhaps?), then we could increase 
program efficiency at the same take as making code less verbose.

As to #3... my name is not Simon.