[Haskell-cafe] Modelling structurally subtyped and cyclic data in type safe Haskell

Sat Dec 10 21:02:10 UTC 2016

Dear Aura,

apparently, the data type that you are dealing with is very complex. I 
don't fully understand it, so I do not know what "the best" 
representation as a Haskell data type would be, but maybe I can help 
with some general guidelines and pointers.

First, the most important guideline is probably to remember that a type 
system is way to have someone (the compiler) check your program and 
reject it when it is "obviously" wrong. This is extremely helpful, but 
it is up to you to decide how far you want to take it. Your data type 
comes with many invariants ("The fields of a record are always either 
type declarations or constant declarations, or ..."), but it may not 
necessarily be the best idea to encode *all* of them in the types, it 
might be more elegant to encode only *most* of them in the types.

For instance, you could make a type `Declaration` and use it for both 
`AnyTopLevelDecl` and `StructFieldDecl`. and accept the fact that the 
compiler may not catch all mistakes that you might make here. If you 
need to write less code that way, this may well be worth the trade-off.

Second, it may helpful to look at how others do it. Your writing 
indicates that are already very knowledgeable, so this advice may be not 
very helpful, but for the sake of completeness, I would like to mention 
it anyway. For example, the `haskell-src-exts` package implements a 
syntax tree for Haskell, maybe it can offer an inspiration or two.

   http://hackage.haskell.org/packages/search?terms=haskell-src-exts

Third, some more concrete ideas.

A. When one node type is a subset of another, it may be helpful to 
encode the data types in the same way:

     data TopLevelDecl
         = StructField StructFieldDecl
         | Namespace NamespaceDecl
     data StructFieldDecl
         = Const ConstDecl
         | ...

Then, you can at least reuse the same code for everything that is a 
`StructFieldDecl`.

B. If you don't like that, you can try anonymous sums instead, which are 
built with the type operator (:+:) available in the module 
`GHC.Generics`. It's essentially the same thing as the `Either` type, 
but the syntax is a bit slicker:

    type TopLevelDecl = NamespaceDecl :+: ConstDecl :+: ...
    type StructFieldDecl = ConstDecl :+: ...

See also the paper "Data Types à la Carte"

   http://www.cs.ru.nl/~W.Swierstra/Publications/DataTypesALaCarte.pdf

for more.

C. Phantom types (and, more generally, GADTs) are your friend, and you 
have already figured this out yourself. Phantom types can also help you 
with the unique identifiers. Instead of using a plain type

     type Key = Int           -- Id

you can write

     newtype Key a = K Int    -- Id that points to a value of type a

and use

     data FunctionDecl = FunctionDecl
         { parameters :: [Key Type]
         }

to restrict the type of things that the key can refer to. (The argument 
to the `Key` type is called a "phantom", because the representation of 
the Key does not care about it.)

Of course, you would then need one map for each node type that you want 
to have keys for. If that seems to onerous, there is also the option to 
use a "heterogeneous map", e.g. from my package

   http://hackage.haskell.org/package/vault

It gives you a `Key` type constructor and a `Vault` type, which contains 
the corresponding key-value pairs. It's a bit like `STRef`, but pure.

I hope this helps!

Best regards,
Heinrich Apfelmus

PS: Your project seems quite valuable (and ambitious). The Haskell 
ecosystem features nice tools for generating FFI imports from C, but C++ 
support has been sorely lacking, if I remember correctly. There are only 
few offerings on hackage:

   http://hackage.haskell.org/packages/#cat:FFI%20Tools

--
http://apfelmus.nfshost.com

Aura Kelloniemi wrote:
> Hello
> 
> This is my first post to Haskell Café. I am a hobbyist programmer who has lots
> of experience with imperative and OO languages ranging from 80286 assembly to
> Ada and Ruby. Last four years I've been learning Haskell, which has been a
> painfully educational experience - I have a good understanding about how CPUs
> work, but very little understanding about maths. This is a very long post, I apologize.
> 
> Now I'm trying to create a tool which automatically generates foreign function
> interface (FFI) imports from C and C++ to Haskell. I have written a GCC plugin
> which extracts all relevant (and lots of irrelevant) information from GCC's
> TREEs and dumps the output as JSON. It also dumps preprocessor macros which is
> one of the reasons this tol is being given birth.
> 
> The tool is used like Shake - import Development.Familiar and write code which
> glues all relevant data processing and FFI generation functions together.
> 
> This is also going to be a reusable library which others can use for whatever
> they find the data to be useful for. This could include making FFI bindings to
> other languages. Some code analysis tools also might find the data useful.
> 
> I'm having a hard time importing this data to Haskell in a type safe way.
> GCC's TREEs are structured in an OO way. I have tried to find a more
> functional way of representing them, but so far with no luck.
> 
> The JSON data is typed, and the hierarchy is exactly the same as in GCC. Below
> I construct a tree of the hierarchy where I list a few of possible types and
> very few of the fields the data types contain. For listing fields and their
> types I use a syntax similar to Haskell. This shortened listing should be
> enough for illustration:
> 
> * macro: (data fields include: name :: string, tokens :: [token], etc.)
> * tree: all nodes have a unique ID to be able to recover the cycles
>     among the node graph; (id :: integer)
> * * translation_unit: (declarations :: [top_level_declaration])
> * * constant: a compile-time constant value; (type :: type)
> * * * complex_constant: (value :: complex)
> * * * integer_constant: (value :: integer)
> * * namespace_decl: (name :: identifier,
>                      declarations :: [top_level_declaration])
> * * declaration: (name :: identifier, type :: type)
> * * * const_decl: One component of enum type; (value :: type)
> * * * function_decl: (type :: function_type, params :: [param_decl],
>                       result :: result_decl)
> * * * type_decl: (no additional fields)
> * * type: (declaration :: type_decl, name :: identifier,
>            size :: integer_constant, alignment :: integer,
>            qualifiers :: [qualifier], completeness :: bool)
> * * * array_type: (element_type :: type, element_count :: integer)
> * * * numeric_type: (precision :: integer, signed :: bool,
>                      min_value :: integer_constant,
>                      max_value :: integer_constant)
> * * * * enumeral_type: (values :: [const_decl])
> * * * function_type: (param_types :: [type], result_type :: type)
> * * * pointer_type: (referred_type :: type)
> * * * record_type: struct (C or C++)/union;
>         (base_types :: [(type, access_info)],
>          fields :: [field_decl or const_decl or type_decl],
>          methods :: [method_decl])
> * reference: Any node may be replaced by a reference. They can be
>     forward or backward references. (referred_id :: integer)
> 
> As you may see, tree nodes form many many cycles together. Subclasses
> sometimes provide stronger invariants for their fields than the parent
> classes. For example all declarations are associated with a type and for
> function_decl nodes this type is always a function_type.
> 
> Now to my question. How to represent this in Haskell. Because of the enormous
> amount of fields per tree node, I certainly need to use records.
> 
> -- Haskell follows:
> data BaseTree = BaseTree { id :: Int, ... ... }
> data BaseDecl = BaseDecl { baseTree :: BaseTree, name :: Identifier,
>                            declType :: Type, ... }
> data ConstDecl = ConstDecl { baseDecl :: BaseDecl, value :: ConstValue }
> data ConstValue = ComplexConst Complex | IntegerConst Integer
> data TypeDecl = TypeDecl { baseDecl :: BaseDecl }
> 
> But... What shall I do about the heterogenous lists which many nodes contain.
> I wish I don't need to make myriards of ADTs for them, like:
> 
> data AnyTopLevelDecl = ATLDConst ConstDecl
>                      | ATLDFunction FunctionDecl
>                      | ATLDNamespace NamespaceDecl
>                      | ATLDType TypeDecl -- ...
> data StructFieldDecl = SFDConst ConstDecl
>                      | SFDField FieldDecl
>                      | SFDType TypeDecl
> 
> And so forth, basicly each heterogenous list gets its own ADT, because there
> is often variance in which node types can appear in the lists.
> 
> I could use one big ADT like AnyDecl. However, to me it would feel like a bad
> design, if my types allowed RecordType nodes to reference NamespaceDecls, and
> therefore I don't want to pack all nodes into one type, unless I could use
> some type parameters or constraints to restrict the possible values the ADT
> could take.
> 
> At first I thought that the following could solve my problem:
> 
> data SomeDecl a = SomeDecl where
>     SomeConst :: ConstDecl -> SomeDecl ConstDecl
>     SomeField :: FieldDecl ->SomeDecl FieldDecl
>     SomeFunction :: FunctionDecl -> SomeDecl FunctionDecl
>     SomeType :: TypeDecl -> SomeDecl TypeDecl
> 
> -- and then I would use type classes like this
> class TopLevelDecl decl where
>     -- No methods needed
> instance TopLevelDecl ConstDecl
> instance TopLevelDecl FunctionDecl
> instance TopLevelDecl TypeDecl
> 
> -- And then I could write functions like this:
> handleTopLevelDecl :: TopLevelDecl d => SomeDecl d -> Whatever
> handleTopLevelDecl (FunctionDecl f) = -- ...
> -- and I'd list all cases which have a TopLevelDecl instance, but not
> -- those which don't have. I also thought that my data types could look like
> -- this:
> 
> data NamespaceDecl where
>     NamespaceDecl :: {
>         baseTree :: BaseTree,
>         forall a. TopLevelDecl a => declarations :: [a]
>         }
> 
> Obviously this does not work, because type classes are open, and even though
> GHC prevents me from calling handleTopLevelDecl with (SOmeDecl FieldDecl), it
> still complains if I don't define an equation for it, because in some other
> module somebody might make a TopLevelDecl instance for FieldDecl.
> 
> Now I would like to know if there is a way to solve my problem in an elegant
> way, e.g. by using type families.
> 
> My second question is how to resolve the cyclicity of the node graph. I don't
> want to tie the knot, because I want to be able to manipulate the graph.
> 
> Maps are one option, yes:
> 
> data AnyNode = (list all possible nodes)
> type Id = Int
> type NodeIdMap = Map Id AnyNode
> 
> -- Then my data types refer to other nodes like this:
> data FunctionDecl = FunctionDecl {
>     parameters :: [Id]
>     }
> 
> Now I lose all type safety. I would have to insert a run-time check to every
> dereference of an Id if I want to make sure that function parameter
> declarations don't contain StringConstants.
> 
> I've also considered using STRefs. During JSON parsing I would need to
> populate the nodes with the node Ids they refer to and then have a separate
> pass which turns all Ids to typed STRefs. At least with this approach I would
> get all possible run-time errors straight after parsing, and not when the data
> is used.
> 
> I'm again sorry for this long elaboration. However I think that if you
> consider answering me you have a pretty good understanding of my problem and
> also a good knowledge of my level of understanding Haskell. I would like to
> unlearn OO-style thinking, and I would want to find a data and type
> representation which feels Haskellish. However I don't want to cut down the
> data. I want all information to be accessible in case somebody needs it.
> 
> Thank you in advance!
>