[Haskell-cafe] Designing DSL with explicit sharing [was: I love purity, but it's killing me]

Wed Feb 13 04:33:19 EST 2008

Tom Hawkins  wrote:
] My DSLs invariably define a datatype to capture expressions; something
] like this:
]
] data Expression
]   = Add Expression Expression
]   | Sub Expression Expression
]   | Variable String
]   | Constant Int
]   deriving Eq
] The problem comes when I want to generate efficient code from an
] Expression (ie. to C or some other target language).  The method I use
] invovles converting the tree of subexpressions into an acyclic graphic
] to eliminate common subexpressions.  The nodes are then topologically
] ordered and assigned an instruction, or statement for each node.  For
] example:
]
] let a = Add (Constant 10) (Variable "i1")
]     b = Sub (Variable "i2") (Constant 2)
]     c = Add a b
] The process of converting an expression tree to a graph uses either Eq
] or Ord (either derived or a custom instance) to search and build a set
] of unique nodes to be ordered for execution.  In this case "a", then
] "b", then "c".  The problem is expressions often have shared,
] equivalent subnodes, which dramatically grows the size of the tree.
] For example:
]
] let d = Add c c
]     e = Add d d    -- "e" now as 16 leaf nodes.
]
] As these trees grow in size, the equality comparison in graph
] construction quickly becomes the bottleneck for DSL compilation.

We show the design of the same sort of DSL that explicitly maintains
sharing information and where node comparisons are quick, because we
are comparing hashes rather than trees themselves. Our approach is
assuredly safe, pure, and Haskell98 (save for disabling of the
monomorphism restriction, which is done solely to avoid writing
signatures). No GHC-specific behavior is relied upon.

> {-# OPTIONS_GHC -fno-monomorphism-restriction #-}
>
> module DSL where
>
> import Data.IntMap as IM
> import Control.Monad.State

The approach is based on the final tagless representation. Here is our
DSL:

> class Exp repr where
>    constant :: Int -> repr Int
>    variable :: String -> repr Int
>    add :: repr Int -> repr Int -> repr Int
>    sub :: repr Int -> repr Int -> repr Int

Tom Hawkins' test expressions now look as follows

> a = add (constant 10) (variable "i1")
> b = sub (variable "i2") (constant 2)
> c = add a b
> d = add c c
> e = add d d

which is the same as before modulo the case of the identifiers:
everything is in lower case.

We can show the expressions as before: showing is one way
of evaluating things

> newtype S t = S{unS :: String}
>
> instance Exp S where
>     constant x = S $ show x
>     variable x = S x
>     add e1 e2 = S( unS e1 ++ " + " ++ unS e2)
>     sub e1 e2 = S( unS e1 ++ " - " ++ unS e2)
>
> test_showe = unS e

*DSL> test_showe
"10 + i1 + i2 - 2 + 10 + i1 + i2 - 2 + 10 + i1 + i2 - 2 + 10 + i1 + i2 - 2"

We can write an evaluator for the expressions

> type REnv = [(String,Int)]
> newtype R t = R{unR :: REnv -> t} -- A reader Monad, actually
>
> instance Exp R where
>     constant x = R $ const x
>     variable x = R ( \env -> maybe (error $ "no var: " ++ x) id $ 
> 		     Prelude.lookup x env)
>     add e1 e2 = R(\env -> unR e1 env + unR e2 env)
>     sub e1 e2 = R(\env -> unR e1 env - unR e2 env)

> test_vale = unR e [("i1",5),("i2",10)] -- 92

We stress: we are using exactly the same expression e as before. 
We are only evaluating it differently. The gist of the final tagless
approach is to write an expression once and evaluate it many times.

Now, we chose a different representation: to make sharing explicit
We chose not to rely on GHC; we don't care if in (add c c), the two
c are shared or copied. It is not observable in pure Haskell,
and we don't care. We build our acyclic graph nevertheless.

> type ExpHash = Int

We stress: ACode is NOT a recursive data structure, so the comparison
of ACode values takes constant time!

> data ACode = AConst ExpHash |
> 	       AVar   ExpHash |
> 	       AAdd   ExpHash |
> 	       ASub   ExpHash
> 	     deriving (Eq,Show)

> data ExpMaps = ExpMaps{ hashcnt :: ExpHash, -- to generate new Hash
> 			  ctmap :: IntMap Int,
> 			  vrmap :: IntMap String,
> 			  admap :: IntMap (ACode,ACode),
> 			  sumap :: IntMap (ACode,ACode)}
> 	     deriving Show
> exmap0 = ExpMaps 0 IM.empty IM.empty IM.empty IM.empty
>
>
> newtype A t = A{unA :: State ExpMaps ACode}

Granted, the following could be done far more efficiently: we need
bimaps.

> loookupv :: Eq v => v -> IntMap v -> Maybe Int
> loookupv v = IM.foldWithKey 
> 	     (\k e z -> maybe (if e == v then Just k else Nothing) (const z) z) 
> 	     Nothing
>

> record con prj upd x = do
>   s <- get
>   maybe (do let s' = upd (s{hashcnt = succ (hashcnt s)})
> 		           (IM.insert (hashcnt s) x (prj s))
> 	      put s'
> 	      return (con $ hashcnt s))
>         (return . con) $ loookupv x (prj s)

> instance Exp A where
>     constant x = A(record AConst ctmap (\s e -> s{ctmap = e}) x)
>     variable x = A(record AVar   vrmap (\s e -> s{vrmap = e}) x)
>     add e1 e2  = A(do
> 		     h1 <- unA e1
> 		     h2 <- unA e2
> 		     record AAdd admap (\s e -> s{admap = e}) (h1,h2))
>     sub e1 e2  = A(do
> 		     h1 <- unA e1
> 		     h2 <- unA e2
> 		     record ASub sumap (\s e -> s{sumap = e}) (h1,h2))

Again, we are using the very same expression e we wrote at the very
beginning:

> test_sme = runState (unA e) exmap0

*DSL> test_sme
(AAdd 8,
  ExpMaps {hashcnt = 9, 
    ctmap = fromList [(0,10),(4,2)], 
    vrmap = fromList [(1,"i1"),(3,"i2")], 
    admap = fromList [(2,(AConst 0,AVar 1)),(6,(AAdd 2,ASub 5)),
                      (7,(AAdd 6,AAdd 6)),(8,(AAdd 7,AAdd 7))], 
    sumap = fromList [(5,(AVar 3,AConst 4))]})

We retain all the information about expression 'e'. In addition, all
sharing is fully explicit. As we can see, the evaluation process finds
common subexpressions automatically.